Solr JSON facet range out of memory exception
Dear all Solr users/developeres, Hi, I am going to use Solr JSON facet range on a date filed which is stored as long milis. Unfortunately I got java heap space exception no matter how much memory assigned to Solr Java heap! I already test that with 2g heap space for Solr core with 50k documents!! Is there any performance concern regarding the JSON facet range on long field? Sincerely, -- A.Nazemian
Re: Solr JSON facet range out of memory exception
Dear Yonik, Hi, The entire index has 50k documents not the faceted one. It is just a test case right now! I used the JSON facet API, here is my query after encoding: http: //10.102.1.5: 8983/solr/edgeIndex/select?q=*%3A*&fq=stat_owner_id: 122952&rows=0&wt=json&indent=true&facet=true&json.facet=%7bresult: %7b type: range, field: stat_date, start: 146027158386, end: 1460271583864, gap: 1 %7d%7d Sincerely, On Sun, Apr 10, 2016 at 4:56 PM, Yonik Seeley wrote: > On Sun, Apr 10, 2016 at 3:47 AM, Ali Nazemian > wrote: > > Dear all Solr users/developeres, > > Hi, > > I am going to use Solr JSON facet range on a date filed which is stored > as > > long milis. Unfortunately I got java heap space exception no matter how > > much memory assigned to Solr Java heap! I already test that with 2g heap > > space for Solr core with 50k documents!! > > You mean the entire index is 50K documents? Or do you mean the subset > of documents to be faceted? > If you're getting an OOM with the former (with a 2G heap), it sounds > like you've hit some sort of bug. > > What does your faceting command look like? > > -Yonik > -- A.Nazemian
Solr facet using gap function
Dear all, Hi, I am wondering, is there any way to introduce and add a function for facet gap parameter? I already know there are some Date Math that can be used. (Such as DAY, MONTH, and etc.) I want to add some functions and try to use them as gap in facet range; Is it possible? Sincerely, Ali.
Searching for term sequence including blank character using regex
Dear Solr Users/Developers, Hi, I was wondering what is the correct query syntax for searching sequence of terms with blank character in the middle of sequence. Suppose I am looking for a query syntax with using fq parameter. For example suppose I want to search for all documents having "hello world" sequence using fq parameter. I am not sure why using fq=content:/.*hello world.*/ did not works for tokenized field in this situation. However, fq=content:/.*hello.*/ did work for the same field. Is there any possible fq query syntax for such searching requirement? Best regards. -- A.Nazemian
Solr re-indexing in case of store=false
Dear all, Hi, I was wondering, is it possible to re-index Solr 6.0 data in case of store=false? I am using Solr as a secondary datastore, and for the sake of space efficiency all the fields (except id) are considered as store=false. Currently, due to some changes in application business, Solr schema should change, and in order to see the effect of changing schema on old data, I have to do the re-index process. I know that one way of re-indexing in Solr is reading data from one collection (core) and inserting that to another one, but this solution is not possible for store=false fields, and re-indexing the whole data through primary datastore is kind of costly, so I would be grateful if somebody could introduce other way of re-indexing the whole data without using another datastore. Sincerely, -- A.Nazemian
Re: Solr re-indexing in case of store=false
Dear Erick, Hi, Thank you very much. About the storing part you are right, unless the primary datastore uses some kind of data compression which in my case it does (I am using Cassandra as a primary datastore), and I am not sure about Solr that it has any kind of compression or not. According to your reply, it seems that I have to do that in a hard way. I mean using the primary datastore to build the index from scratch. Sincerely, On Sun, May 8, 2016 at 11:07 PM, Erick Erickson wrote: > bq: I would be grateful if somebody could introduce other way of > re-indexing > the whole data without using another datastore > > Not possible currently. Consider what's _in_ the index when stored="false". > The actual terms are the output of the entire analysis chain, including > stemming, stopword removal, synonym substitution etc. Since the > indexing process is lossy, you simply cannot reconstruct the original > stream from the indexed terms. > > I suppose one _could_ do this in the case of docValues only index with > the new return-values-from-docvalues functionality, but even that's lossy > because the order of returned values may not be the original insertion > order. And if that suits your needs, a pretty simple driver program would > suffice. > > To do this from indexed-only terms you'd have to somehow store the > original version of each term or store some codes indicating exactly > how to reconstruct the original steam, which very possibly would take > up as much space as if you'd just stored the values anyway. _And_ it > would burden every one else who didn't want to do this with a bloated > index. > > Best, > Erick > > On Sun, May 8, 2016 at 4:25 AM, Ali Nazemian > wrote: > > Dear all, > > Hi, > > I was wondering, is it possible to re-index Solr 6.0 data in case of > > store=false? I am using Solr as a secondary datastore, and for the sake > of > > space efficiency all the fields (except id) are considered as > store=false. > > Currently, due to some changes in application business, Solr schema > should > > change, and in order to see the effect of changing schema on old data, I > > have to do the re-index process. I know that one way of re-indexing in > > Solr is reading data from one collection (core) and inserting that to > > another one, but this solution is not possible for store=false fields, > and > > re-indexing the whole data through primary datastore is kind of costly, > so > > I would be grateful if somebody could introduce other way of re-indexing > > the whole data without using another datastore. > > > > Sincerely, > > > > -- > > A.Nazemian > -- A.Nazemian
java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content
Dears, Hi, I have a collection of 1.6m documents in Solr 5.2.1. When I use facet on field of content this error will appear after around 30s of trying to return the results: null:org.apache.solr.common.SolrException: Exception during facet.field: content at org.apache.solr.request.SimpleFacets$3.call(SimpleFacets.java:632) at org.apache.solr.request.SimpleFacets$3.call(SimpleFacets.java:617) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at org.apache.solr.request.SimpleFacets$2.execute(SimpleFacets.java:571) at org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:642) at org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:285) at org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:102) at org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:255) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:497) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content at org.apache.lucene.uninverting.DocTermOrds.uninvert(DocTermOrds.java:509) at org.apache.lucene.uninverting.DocTermOrds.(DocTermOrds.java:215) at org.apache.lucene.uninverting.DocTermOrds.(DocTermOrds.java:206) at org.apache.lucene.uninverting.DocTermOrds.(DocTermOrds.java:199) at org.apache.lucene.uninverting.FieldCacheImpl$DocTermOrdsCache.createValue(FieldCacheImpl.java:946) at org.apache.lucene.uninverting.FieldCacheImpl$Cache.get(FieldCacheImpl.java:190) at org.apache.lucene.uninverting.FieldCacheImpl.getDocTermOrds(FieldCacheImpl.java:933) at org.apache.lucene.uninverting.UninvertingReader.getSortedSetDocValues(UninvertingReader.java:275) at org.apache.lucene.index.FilterLeafReader.getSortedSetDocValues(FilterLeafReader.java:454) at org.apache.lucene.index.MultiDocValues.getSortedSetValues(MultiDocValues.java:356) at org.apache.lucene.index.SlowCompositeReaderWrapper.getSortedSetDocValues(SlowCompositeReaderWrapper.java:165) at org.apache.solr.request.DocValuesFacets.getCounts(DocValuesFacets.java:72) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:490) at org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:386) at org.apache.solr.request.SimpleFacets$3.call(SimpleFacets.java:626) ... 33 more Here is the schema.xml related to content field: Would you please help me to solve this problem? Best regards. -- A.Nazemian
Re: java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content
Dear Toke and Davidphilip, Hi, The fieldtype text_fa has some custom language specific normalizer and charfilter, here is the schema.xml value related for this field: I did try the facet.method=enum and it works fine. Did you mean that actually applying facet on analyzed field is wrong? Best regards. On Mon, Jul 20, 2015 at 8:07 PM, Toke Eskildsen wrote: > Ali Nazemian wrote: > > I have a collection of 1.6m documents in Solr 5.2.1. > > [...] > > Caused by: java.lang.IllegalStateException: Too many values for > > UnInvertedField faceting on field content > > [...] > > > default="noval" termVectors="true" termPositions="true" > > termOffsets="true"/> > > You are hitting an internal limit in Solr. As davidphilip tells you, the > solution is docValues, but they cannot be enabled for text fields. You need > String fields, but the name of your field suggests that you need > analyzation & tokenization, which cannot be done on String fields. > > > Would you please help me to solve this problem? > > With the information we have, it does not seem to be easy to solve: It > seems like you want to facet on all terms in your index. As they need to be > String (to use docValues), you would have to do all the splitting on white > space, normalization etc. outside of Solr. > > - Toke Eskildsen > -- A.Nazemian
Re: java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content
Dear Erick, Actually faceting on this field is not a user wanted application. I did that for the purpose of testing the customized normalizer and charfilter which I used. Therefore it just used for the purpose of testing. Anyway I did some googling on this error and It seems that changing facet method to enum works in other similar cases too. I dont know the differences between fcs and enum methods on calculating facet behind the scene, but it seems that enum works better in my case. Best regards. On Tue, Jul 21, 2015 at 9:08 AM, Erick Erickson wrote: > This really seems like an XY problem. _Why_ are you faceting on a > tokenized field? > What are you really trying to accomplish? Because faceting on a generalized > content field that's an analyzed field is often A Bad Thing. Try going > into the > admin UI>> Schema Browser for that field, and you'll see how many unique > terms > you have in that field. Faceting on that many unique terms is rarely > useful to the > end user, so my suspicion is that you're not doing what you think you > are. Or you > have an unusual use-case. Either way, we need to understand what use-case > you're trying to support in order to respond helpfully. > > You say that using facet.enum works, this is very surprising. That method > uses > the filterCache to create a bitset for each unique term. Which is totally > incompatible with the uninverted field error you're reporting, so I > clearly don't > understand something about your setup. Are you _sure_? > > Best, > Erick > > On Mon, Jul 20, 2015 at 9:32 PM, Ali Nazemian > wrote: > > Dear Toke and Davidphilip, > > Hi, > > The fieldtype text_fa has some custom language specific normalizer and > > charfilter, here is the schema.xml value related for this field: > > positionIncrementGap="100"> > > > > > class="com.ictcert.lucene.analysis.fa.FarsiCharFilterFactory"/> > > > > > > > class="com.ictcert.lucene.analysis.fa.FarsiNormalizationFilterFactory"/> > > > words="lang/stopwords_fa.txt" /> > > > > > > > class="com.ictcert.lucene.analysis.fa.FarsiCharFilterFactory"/> > > > > > > > class="com.ictcert.lucene.analysis.fa.FarsiNormalizationFilterFactory"/> > > > words="lang/stopwords_fa.txt" /> > > > > > > > > I did try the facet.method=enum and it works fine. Did you mean that > > actually applying facet on analyzed field is wrong? > > > > Best regards. > > > > On Mon, Jul 20, 2015 at 8:07 PM, Toke Eskildsen > > wrote: > > > >> Ali Nazemian wrote: > >> > I have a collection of 1.6m documents in Solr 5.2.1. > >> > [...] > >> > Caused by: java.lang.IllegalStateException: Too many values for > >> > UnInvertedField faceting on field content > >> > [...] > >> > >> > default="noval" termVectors="true" termPositions="true" > >> > termOffsets="true"/> > >> > >> You are hitting an internal limit in Solr. As davidphilip tells you, the > >> solution is docValues, but they cannot be enabled for text fields. You > need > >> String fields, but the name of your field suggests that you need > >> analyzation & tokenization, which cannot be done on String fields. > >> > >> > Would you please help me to solve this problem? > >> > >> With the information we have, it does not seem to be easy to solve: It > >> seems like you want to facet on all terms in your index. As they need > to be > >> String (to use docValues), you would have to do all the splitting on > white > >> space, normalization etc. outside of Solr. > >> > >> - Toke Eskildsen > >> > > > > > > > > -- > > A.Nazemian > -- A.Nazemian
Re: java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content
Dear Erick, I found another thing, I did check the number of unique terms for this field using schema browser, It reported 1683404 number of terms! Does it exceed the maximum number of unique terms for "fcs" facet method? I read somewhere it should be more than 16m does it true?! Best regards. On Tue, Jul 21, 2015 at 10:00 AM, Ali Nazemian wrote: > Dear Erick, > > Actually faceting on this field is not a user wanted application. I did > that for the purpose of testing the customized normalizer and charfilter > which I used. Therefore it just used for the purpose of testing. Anyway I > did some googling on this error and It seems that changing facet method to > enum works in other similar cases too. I dont know the differences between > fcs and enum methods on calculating facet behind the scene, but it seems > that enum works better in my case. > > Best regards. > > On Tue, Jul 21, 2015 at 9:08 AM, Erick Erickson > wrote: > >> This really seems like an XY problem. _Why_ are you faceting on a >> tokenized field? >> What are you really trying to accomplish? Because faceting on a >> generalized >> content field that's an analyzed field is often A Bad Thing. Try going >> into the >> admin UI>> Schema Browser for that field, and you'll see how many unique >> terms >> you have in that field. Faceting on that many unique terms is rarely >> useful to the >> end user, so my suspicion is that you're not doing what you think you >> are. Or you >> have an unusual use-case. Either way, we need to understand what use-case >> you're trying to support in order to respond helpfully. >> >> You say that using facet.enum works, this is very surprising. That method >> uses >> the filterCache to create a bitset for each unique term. Which is totally >> incompatible with the uninverted field error you're reporting, so I >> clearly don't >> understand something about your setup. Are you _sure_? >> >> Best, >> Erick >> >> On Mon, Jul 20, 2015 at 9:32 PM, Ali Nazemian >> wrote: >> > Dear Toke and Davidphilip, >> > Hi, >> > The fieldtype text_fa has some custom language specific normalizer and >> > charfilter, here is the schema.xml value related for this field: >> > > positionIncrementGap="100"> >> > >> > > > class="com.ictcert.lucene.analysis.fa.FarsiCharFilterFactory"/> >> > >> > >> > > > class="com.ictcert.lucene.analysis.fa.FarsiNormalizationFilterFactory"/> >> > > > words="lang/stopwords_fa.txt" /> >> > >> > >> > > > class="com.ictcert.lucene.analysis.fa.FarsiCharFilterFactory"/> >> > >> > >> > > > class="com.ictcert.lucene.analysis.fa.FarsiNormalizationFilterFactory"/> >> > > > words="lang/stopwords_fa.txt" /> >> > >> > >> > >> > I did try the facet.method=enum and it works fine. Did you mean that >> > actually applying facet on analyzed field is wrong? >> > >> > Best regards. >> > >> > On Mon, Jul 20, 2015 at 8:07 PM, Toke Eskildsen > > >> > wrote: >> > >> >> Ali Nazemian wrote: >> >> > I have a collection of 1.6m documents in Solr 5.2.1. >> >> > [...] >> >> > Caused by: java.lang.IllegalStateException: Too many values for >> >> > UnInvertedField faceting on field content >> >> > [...] >> >> > > >> > default="noval" termVectors="true" termPositions="true" >> >> > termOffsets="true"/> >> >> >> >> You are hitting an internal limit in Solr. As davidphilip tells you, >> the >> >> solution is docValues, but they cannot be enabled for text fields. You >> need >> >> String fields, but the name of your field suggests that you need >> >> analyzation & tokenization, which cannot be done on String fields. >> >> >> >> > Would you please help me to solve this problem? >> >> >> >> With the information we have, it does not seem to be easy to solve: It >> >> seems like you want to facet on all terms in your index. As they need >> to be >> >> String (to use docValues), you would have to do all the splitting on >> white >> >> space, normalization etc. outside of Solr. >> >> >> >> - Toke Eskildsen >> >> >> > >> > >> > >> > -- >> > A.Nazemian >> > > > > -- > A.Nazemian > -- A.Nazemian
Optimizing Solr indexing over WAN
Dears, Hi, I know that there are lots of tips about how to make the Solr indexing faster. Probably some of the most important ones which are considered in client side are choosing batch indexing and multi-thread indexing. There are other important factors that are server side which I dont want to mentioned here. Anyway my question would be is there any best practice for number of client threads and the size of batch available over WAN network? Since the client and servers are connected over WAN network probably some of the performance conditions such as network latency, bandwidth and etc. are different from LAN network. Another think that is matter for me is the fact that document sizes are might be different in diverse scenarios. For example when you want to index web-pages the size of document might be from 1KB to 200KB. In such case choosing batch size according to the number of documents is probably not the best way of optimizing index performance. Probably choosing based on the size of batch size in KB/MB would be better from the network point of view. However, from the Solr side document numbers matter. So if I want to summarize my questions here what am I looking for: 1- Is there any best practice available for Solr client side performance tuning over WAN network for the purpose of indexing/reindexing/updating? Does it different from LAN network? 2- Which one is matter: number of documents or the total size of documents in batch? Best regards. -- A.Nazemian
Re: java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content
Dear Yonik, Hi, Really thanks for you response. Best regards. On Tue, Jul 21, 2015 at 5:42 PM, Yonik Seeley wrote: > On Tue, Jul 21, 2015 at 3:09 AM, Ali Nazemian > wrote: > > Dear Erick, > > I found another thing, I did check the number of unique terms for this > > field using schema browser, It reported 1683404 number of terms! Does it > > exceed the maximum number of unique terms for "fcs" facet method? > > The real limit is not simple since the data is not stored in a simple > way (it's compressed). > > > I read > > somewhere it should be more than 16m does it true?! > > More like 16MB of delta-coded terms per block of documents (the index > is split up into 256 blocks for this purpose) > > See DocTermOrds.java if you want more details than that. > > -Yonik > -- A.Nazemian
Re: java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content
Dear Alessandro, Thank you very much. Yeah sure it is far better, I did not think of that ;) Best regards. On Wed, Jul 22, 2015 at 2:31 PM, Alessandro Benedetti < benedetti.ale...@gmail.com> wrote: > In addition to Erick answer : > I agree 100% on your observations, but I would add that actually, DocValues > should be provided for all not tokenized fields instead of for all not > analysed fields. > > In the end there will be not practical difference if you build the > docValues structures for fields that have a keywordTokenizer ( and for > example a lowercaseTokenFilter following) . > Some charFilters before and simple token filter after can actually be > useful when sorting or faceting ( let's simplify those 2 as main uses for > DocValues) . > > Of course relaxing the use of DocValues from primitive types to analysed > types can be problematic, but there are scenarios where can be a good fit. > I should study a little bit more in deep, what are the current constraints > that are blocking docValues to be applied to analysed fields. > > Cheers > > > Cheers > > 2015-07-21 5:38 GMT+01:00 Erick Erickson : > > > This really seems like an XY problem. _Why_ are you faceting on a > > tokenized field? > > What are you really trying to accomplish? Because faceting on a > generalized > > content field that's an analyzed field is often A Bad Thing. Try going > > into the > > admin UI>> Schema Browser for that field, and you'll see how many unique > > terms > > you have in that field. Faceting on that many unique terms is rarely > > useful to the > > end user, so my suspicion is that you're not doing what you think you > > are. Or you > > have an unusual use-case. Either way, we need to understand what use-case > > you're trying to support in order to respond helpfully. > > > > You say that using facet.enum works, this is very surprising. That method > > uses > > the filterCache to create a bitset for each unique term. Which is totally > > incompatible with the uninverted field error you're reporting, so I > > clearly don't > > understand something about your setup. Are you _sure_? > > > > Best, > > Erick > > > > On Mon, Jul 20, 2015 at 9:32 PM, Ali Nazemian > > wrote: > > > Dear Toke and Davidphilip, > > > Hi, > > > The fieldtype text_fa has some custom language specific normalizer and > > > charfilter, here is the schema.xml value related for this field: > > > > positionIncrementGap="100"> > > > > > > > > class="com.ictcert.lucene.analysis.fa.FarsiCharFilterFactory"/> > > > > > > > > > > > > class="com.ictcert.lucene.analysis.fa.FarsiNormalizationFilterFactory"/> > > > > > words="lang/stopwords_fa.txt" /> > > > > > > > > > > > class="com.ictcert.lucene.analysis.fa.FarsiCharFilterFactory"/> > > > > > > > > > > > > class="com.ictcert.lucene.analysis.fa.FarsiNormalizationFilterFactory"/> > > > > > words="lang/stopwords_fa.txt" /> > > > > > > > > > > > > I did try the facet.method=enum and it works fine. Did you mean that > > > actually applying facet on analyzed field is wrong? > > > > > > Best regards. > > > > > > On Mon, Jul 20, 2015 at 8:07 PM, Toke Eskildsen < > t...@statsbiblioteket.dk> > > > wrote: > > > > > >> Ali Nazemian wrote: > > >> > I have a collection of 1.6m documents in Solr 5.2.1. > > >> > [...] > > >> > Caused by: java.lang.IllegalStateException: Too many values for > > >> > UnInvertedField faceting on field content > > >> > [...] > > >> > > >> > default="noval" termVectors="true" termPositions="true" > > >> > termOffsets="true"/> > > >> > > >> You are hitting an internal limit in Solr. As davidphilip tells you, > the > > >> solution is docValues, but they cannot be enabled for text fields. You > > need > > >> String fields, but the name of your field suggests that you need > > >> analyzation & tokenization, which cannot be done on String fields. > > >> > > >> > Would you please help me to solve this problem? > > >> > > >> With the information we have, it does not seem to be easy to solve: It > > >> seems like you want to facet on all terms in your index. As they need > > to be > > >> String (to use docValues), you would have to do all the splitting on > > white > > >> space, normalization etc. outside of Solr. > > >> > > >> - Toke Eskildsen > > >> > > > > > > > > > > > > -- > > > A.Nazemian > > > > > > -- > -- > > Benedetti Alessandro > Visiting card - http://about.me/alessandro_benedetti > Blog - http://alexbenedetti.blogspot.co.uk > > "Tyger, tyger burning bright > In the forests of the night, > What immortal hand or eye > Could frame thy fearful symmetry?" > > William Blake - Songs of Experience -1794 England > -- A.Nazemian
Solr MLT Interestingterms return different terms than Lucene MoreLikeThis for some of the documents
Hi, I am going to implement a searchcomponent for Solr to return document main keywords with using the more like this interesting terms. The main part of implemented component which uses mlt.retrieveInterestingTerms by lucene docID does not work for all of the documents. I mean for some of the documents solr interestingterms returns some useful terms as top tf-idf terms; however, the implemented method returns null! But for other documents both results (solr MLT interesting terms and the mlt.retrieveInterestingTerms(docId)) are the same! Would you please help me through solving this issue? public List getKeywords(int docId) throws SyntaxError { String[] fields = new String[keywordSourceFields.size()]; List terms = new ArrayList(); fields = keywordSourceFields.toArray(fields); mlt.setFieldNames(fields); mlt.setAnalyzer(indexSearcher.getSchema().getIndexAnalyzer()); mlt.setMinTermFreq(minTermFreq); mlt.setMinDocFreq(minDocFreq); mlt.setMinWordLen(minWordLen); mlt.setMaxQueryTerms(maxNumKeywords); mlt.setMaxNumTokensParsed(maxTokensParsed); try { terms = Arrays.asList(mlt.retrieveInterestingTerms(docId)); } catch (IOException e) { LOGGER.error(e.getMessage()); throw new RuntimeException(); } return terms; } *Note:* I did define termVectors=true for all the required fields that I am going to use for the purpose of generating interesting terms (fields array in the corresponding method) Best regards. -- A.Nazemian
Solr cross core join special condition
I was wondering how can I overcome this query requirement in Solr 5.2.1: I have two different Solr cores refer as "core1" and "core2". core1 has some fields such as field1, field2 and field3 and core2 has some other fields such as field1, field4 and field5. I am looking for Solr query which can return all of the documents requiring field1, field2, field3, field4 and field5 with considering some condition on core1. For example: core1: -field1:123 -field2:"foo" -field3:"bar" core2: -field1:123 -field4:"hello" -field5:"world" returning result: field1:123 field2:"foo" field3:"bar" field4:"hello" field4:"world" Thank you very much. Best regards. -- A.Nazemian
Re: Solr cross core join special condition
Dear Mikhail, Hi, I want to enrich the result. Regards On Oct 6, 2015 7:07 PM, "Mikhail Khludnev" wrote: > Hello, > > Why do you need sibling core fields? do you facet? or just want to enrich > result page with them? > > On Tue, Oct 6, 2015 at 6:04 PM, Ali Nazemian > wrote: > > > I was wondering how can I overcome this query requirement in Solr 5.2.1: > > > > I have two different Solr cores refer as "core1" and "core2". core1 has > > some fields such as field1, field2 and field3 and core2 has some other > > fields such as field1, field4 and field5. I am looking for Solr query > which > > can return all of the documents requiring field1, field2, field3, field4 > > and field5 with considering some condition on core1. > > > > For example: > > core1: > > -field1:123 > > -field2:"foo" > > -field3:"bar" > > > > core2: > > -field1:123 > > -field4:"hello" > > -field5:"world" > > > > returning result: > > field1:123 > > field2:"foo" > > field3:"bar" > > field4:"hello" > > field4:"world" > > > > Thank you very much. > > > > Best regards. > > > > -- > > A.Nazemian > > > > > > -- > Sincerely yours > Mikhail Khludnev > Principal Engineer, > Grid Dynamics > > <http://www.griddynamics.com> > >
Re: Solr cross core join special condition
Yeah, but child document transformer is used for nested document inside single core but I am looking for multiple core result joining. Then it seems there is not any way to do that right now and it should be developed somehow. Am I right? Regards. On Oct 6, 2015 9:53 PM, "Mikhail Khludnev" wrote: > thus, something like [child] > > https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents > can be developed. > > On Tue, Oct 6, 2015 at 6:45 PM, Ali Nazemian > wrote: > > > Dear Mikhail, > > Hi, > > I want to enrich the result. > > Regards > > On Oct 6, 2015 7:07 PM, "Mikhail Khludnev" > > wrote: > > > > > Hello, > > > > > > Why do you need sibling core fields? do you facet? or just want to > enrich > > > result page with them? > > > > > > On Tue, Oct 6, 2015 at 6:04 PM, Ali Nazemian > > > wrote: > > > > > > > I was wondering how can I overcome this query requirement in Solr > > 5.2.1: > > > > > > > > I have two different Solr cores refer as "core1" and "core2". core1 > > has > > > > some fields such as field1, field2 and field3 and core2 has some > other > > > > fields such as field1, field4 and field5. I am looking for Solr query > > > which > > > > can return all of the documents requiring field1, field2, field3, > > field4 > > > > and field5 with considering some condition on core1. > > > > > > > > For example: > > > > core1: > > > > -field1:123 > > > > -field2:"foo" > > > > -field3:"bar" > > > > > > > > core2: > > > > -field1:123 > > > > -field4:"hello" > > > > -field5:"world" > > > > > > > > returning result: > > > > field1:123 > > > > field2:"foo" > > > > field3:"bar" > > > > field4:"hello" > > > > field4:"world" > > > > > > > > Thank you very much. > > > > > > > > Best regards. > > > > > > > > -- > > > > A.Nazemian > > > > > > > > > > > > > > > > -- > > > Sincerely yours > > > Mikhail Khludnev > > > Principal Engineer, > > > Grid Dynamics > > > > > > <http://www.griddynamics.com> > > > > > > > > > > > > -- > Sincerely yours > Mikhail Khludnev > Principal Engineer, > Grid Dynamics > > <http://www.griddynamics.com> > >
Re: Solr cross core join special condition
Dear Susheel, Hi, I did check the jira issue that you mentioned but it seems its target is Solr 6! Am I correct? The patch failed for Solr 5.3 due to class not found. For Solr 5.x should I try to implement something similar myself? Sincerely yours. On Wed, Oct 7, 2015 at 7:15 PM, Susheel Kumar wrote: > You may want to take a look at new Solr feature of Streaming API & > Expressions > https://issues.apache.org/jira/browse/SOLR-7584?filter=12333278 > for making joins between collections. > > On Wed, Oct 7, 2015 at 9:42 AM, Ryan Josal wrote: > > > I developed a join transformer plugin that did that (although it didn't > > flatten the results like that). The one thing that was painful about it > is > > that the TextResponseWriter has references to both the IndexSchema and > > SolrReturnFields objects for the primary core. So when you add a > > SolrDocument from another core it returned the wrong fields. I worked > > around that by transforming the SolrDocument to a NamedList. Then when > it > > gets to processing the IndexableFields it uses the wrong IndexSchema, I > > worked around that by transforming each field to a hard Java object > > (through the IndexSchema and FieldType of the correct core). I think it > > would be great to patch TextResponseWriter with multi core writing > > abilities, but there is one question, how can it tell which core a > > SolrDocument or IndexableField is from? Seems we'd have to add an > > attribute for that. > > > > The other possibly simpler thing to do is execute the join at index time > > with an update processor. > > > > Ryan > > > > On Tuesday, October 6, 2015, Mikhail Khludnev < > mkhlud...@griddynamics.com> > > wrote: > > > > > On Wed, Oct 7, 2015 at 7:05 AM, Ali Nazemian > > > wrote: > > > > > > > it > > > > seems there is not any way to do that right now and it should be > > > developed > > > > somehow. Am I right? > > > > > > > > > > yep > > > > > > > > > -- > > > Sincerely yours > > > Mikhail Khludnev > > > Principal Engineer, > > > Grid Dynamics > > > > > > <http://www.griddynamics.com> > > > > > > > > > > -- A.Nazemian
Re: Solr cross core join special condition
Thank you very much. Sincerely yours. On Mon, Oct 12, 2015 at 6:15 AM, Susheel Kumar wrote: > Yes, Ali. These are targeted for Solr 6 but you have the option download > source from trunk, build it and try out these features if that helps in the > meantime. > > Thanks > Susheel > > On Sun, Oct 11, 2015 at 10:01 AM, Ali Nazemian > wrote: > > > Dear Susheel, > > Hi, > > > > I did check the jira issue that you mentioned but it seems its target is > > Solr 6! Am I correct? The patch failed for Solr 5.3 due to class not > found. > > For Solr 5.x should I try to implement something similar myself? > > > > Sincerely yours. > > > > > > On Wed, Oct 7, 2015 at 7:15 PM, Susheel Kumar > > wrote: > > > > > You may want to take a look at new Solr feature of Streaming API & > > > Expressions > > > https://issues.apache.org/jira/browse/SOLR-7584?filter=12333278 > > > for making joins between collections. > > > > > > On Wed, Oct 7, 2015 at 9:42 AM, Ryan Josal wrote: > > > > > > > I developed a join transformer plugin that did that (although it > didn't > > > > flatten the results like that). The one thing that was painful about > > it > > > is > > > > that the TextResponseWriter has references to both the IndexSchema > and > > > > SolrReturnFields objects for the primary core. So when you add a > > > > SolrDocument from another core it returned the wrong fields. I > worked > > > > around that by transforming the SolrDocument to a NamedList. Then > when > > > it > > > > gets to processing the IndexableFields it uses the wrong > IndexSchema, I > > > > worked around that by transforming each field to a hard Java object > > > > (through the IndexSchema and FieldType of the correct core). I think > > it > > > > would be great to patch TextResponseWriter with multi core writing > > > > abilities, but there is one question, how can it tell which core a > > > > SolrDocument or IndexableField is from? Seems we'd have to add an > > > > attribute for that. > > > > > > > > The other possibly simpler thing to do is execute the join at index > > time > > > > with an update processor. > > > > > > > > Ryan > > > > > > > > On Tuesday, October 6, 2015, Mikhail Khludnev < > > > mkhlud...@griddynamics.com> > > > > wrote: > > > > > > > > > On Wed, Oct 7, 2015 at 7:05 AM, Ali Nazemian < > alinazem...@gmail.com > > > > > > wrote: > > > > > > > > > > > it > > > > > > seems there is not any way to do that right now and it should be > > > > > developed > > > > > > somehow. Am I right? > > > > > > > > > > > > > > > > yep > > > > > > > > > > > > > > > -- > > > > > Sincerely yours > > > > > Mikhail Khludnev > > > > > Principal Engineer, > > > > > Grid Dynamics > > > > > > > > > > <http://www.griddynamics.com> > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > A.Nazemian > > > -- A.Nazemian
Re: Solr cross core join special condition
Dear Shawn, Hi, Since in Yonki's Solr blog <http://yonik.com/solr-5-4/> it is mentioned that this feature is one of the Solr 5.4 features. I assume it will back-ported to the next stable release (5.4). Please correct me if it is the wrong assumption. Thank you very much. Sincerely yours. On Mon, Oct 12, 2015 at 12:29 PM, Ali Nazemian wrote: > Thank you very much. > > Sincerely yours. > > On Mon, Oct 12, 2015 at 6:15 AM, Susheel Kumar > wrote: > >> Yes, Ali. These are targeted for Solr 6 but you have the option download >> source from trunk, build it and try out these features if that helps in >> the >> meantime. >> >> Thanks >> Susheel >> >> On Sun, Oct 11, 2015 at 10:01 AM, Ali Nazemian >> wrote: >> >> > Dear Susheel, >> > Hi, >> > >> > I did check the jira issue that you mentioned but it seems its target is >> > Solr 6! Am I correct? The patch failed for Solr 5.3 due to class not >> found. >> > For Solr 5.x should I try to implement something similar myself? >> > >> > Sincerely yours. >> > >> > >> > On Wed, Oct 7, 2015 at 7:15 PM, Susheel Kumar >> > wrote: >> > >> > > You may want to take a look at new Solr feature of Streaming API & >> > > Expressions >> > > https://issues.apache.org/jira/browse/SOLR-7584?filter=12333278 >> > > for making joins between collections. >> > > >> > > On Wed, Oct 7, 2015 at 9:42 AM, Ryan Josal wrote: >> > > >> > > > I developed a join transformer plugin that did that (although it >> didn't >> > > > flatten the results like that). The one thing that was painful >> about >> > it >> > > is >> > > > that the TextResponseWriter has references to both the IndexSchema >> and >> > > > SolrReturnFields objects for the primary core. So when you add a >> > > > SolrDocument from another core it returned the wrong fields. I >> worked >> > > > around that by transforming the SolrDocument to a NamedList. Then >> when >> > > it >> > > > gets to processing the IndexableFields it uses the wrong >> IndexSchema, I >> > > > worked around that by transforming each field to a hard Java object >> > > > (through the IndexSchema and FieldType of the correct core). I >> think >> > it >> > > > would be great to patch TextResponseWriter with multi core writing >> > > > abilities, but there is one question, how can it tell which core a >> > > > SolrDocument or IndexableField is from? Seems we'd have to add an >> > > > attribute for that. >> > > > >> > > > The other possibly simpler thing to do is execute the join at index >> > time >> > > > with an update processor. >> > > > >> > > > Ryan >> > > > >> > > > On Tuesday, October 6, 2015, Mikhail Khludnev < >> > > mkhlud...@griddynamics.com> >> > > > wrote: >> > > > >> > > > > On Wed, Oct 7, 2015 at 7:05 AM, Ali Nazemian < >> alinazem...@gmail.com >> > > > > > wrote: >> > > > > >> > > > > > it >> > > > > > seems there is not any way to do that right now and it should be >> > > > > developed >> > > > > > somehow. Am I right? >> > > > > > >> > > > > >> > > > > yep >> > > > > >> > > > > >> > > > > -- >> > > > > Sincerely yours >> > > > > Mikhail Khludnev >> > > > > Principal Engineer, >> > > > > Grid Dynamics >> > > > > >> > > > > <http://www.griddynamics.com> >> > > > > > >> > > > > >> > > > >> > > >> > >> > >> > >> > -- >> > A.Nazemian >> > >> > > > > -- > A.Nazemian > -- A.Nazemian
Re: Soft commit and hard commit
Dear Midas, Hi, AFAIK, currently Solr uses virtual memory for storing memory maps. Therefore using 36GB from 48GB of ram for Java heap is not recommended. As a rule of thumb do not access more than 25% of your total memory to Solr JVM in usual situations. About your main question, setting softcommit and hardcommit for Solr is highly dependent on your application. A really nice guide for this purpose is presented by lucidworks, In order to find the best value for softcommit and hardcommit please follow this guide: http://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/ Best regards. On Mon, Nov 30, 2015 at 9:48 AM, Midas A wrote: > Machine configuration > > RAM: 48 GB > CPU: 8 core > JVM : 36 GB > > We are updating 70 , 000 docs / hr . what should be our soft commit and > hard commit time to get best results. > > Current configuration : > 6 false autoCommit> > > > 60 > > There are no read on master server. > -- A.Nazemian
Solr 5.2.1 deadlock on commit
Hi, There is a while since I have had problem with Solr 5.2.1 and I could not fix it yet. The only think that is clear to me is when I send bulk update to Solr the commit thread will be blocked! Here is the thread dump output: "qtp595445781-8207" prio=10 tid=0x7f0bf68f5800 nid=0x5785 waiting for monitor entry [0x7f081cf04000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:608) - waiting to lock <0x00067ba2e660> (a java.lang.Object) at org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95) at org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) at org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1635) at org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1612) at org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:161) at org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) at org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:270) at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177) at org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143) at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064) at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654) at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:497) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257) at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555) at java.lang.Thread.run(Thread.java:745) Locked ownable synchronizers: - None FYI there are lots of blocked thread in thread dump report and Solr becomes really slow in this case. The temporary solution would be restarting Solr. But, I am really sick of restarting! I really appreciate if somebody can help me to solve this problem? Best regards. -- A.Nazemian
Re: Solr 5.2.1 deadlock on commit
Dear Emir, Hi, There are some cases that I have soft commit in my application. However, the bulk update part has only hard commit for a bulk of 2500 documents. Here are some information about the whole indexing/updating scenarios: - Indexing part uses soft commit. - In a single update cases soft commit is used. - For bulk update batch hard commit is used (on 2500 documents) - Auto hard commit :120 sec - Auto soft commit: disable Best regards. On Tue, Dec 8, 2015 at 12:35 PM, Emir Arnautovic < emir.arnauto...@sematext.com> wrote: > Hi Ali, > This thread is blocked because cannot obtain update lock - in this > particular case when doing soft commit. I am guessing that there others are > blocked for the same reason. Can you tell us bit more about your setup and > indexing load and procedure? Do you do explicit commits? > > Regards, > Emir > > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ > > > > On 08.12.2015 08:16, Ali Nazemian wrote: > >> Hi, >> There is a while since I have had problem with Solr 5.2.1 and I could not >> fix it yet. The only think that is clear to me is when I send bulk update >> to Solr the commit thread will be blocked! Here is the thread dump output: >> >> "qtp595445781-8207" prio=10 tid=0x7f0bf68f5800 nid=0x5785 waiting for >> monitor entry [0x7f081cf04000] >> java.lang.Thread.State: BLOCKED (on object monitor) >> at >> >> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:608) >> - waiting to lock <0x00067ba2e660> (a java.lang.Object) >> at >> >> org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95) >> at >> >> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) >> at >> >> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1635) >> at >> >> org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1612) >> at >> >> org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:161) >> at >> >> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) >> at >> >> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) >> at >> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:270) >> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177) >> at >> >> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98) >> at >> >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) >> at >> >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143) >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064) >> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654) >> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450) >> at >> >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227) >> at >> >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196) >> at >> >> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) >> at >> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) >> at >> >> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) >> at >> >> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) >> at >> >> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) >> at >> >> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127) >> at >> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) >> at >> >> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) >> at >> >> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061) >> at >> >> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) >> at >> >> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) >> at >> >> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollect
Re: Solr 5.2.1 deadlock on commit
The indexing load is as follows: - Around 1000 documents every 5 mins. - The indexing speed is slow because of the complicated analyzer which is applied to each document. It takes around 60 seconds to index 1000 documents with applying this analyzer (It is really slow. However, based on the analyzing part I think it would be acceptable). - The concurrentsolrclient is used in all the indexing/updating cases. Regards. On Tue, Dec 8, 2015 at 6:36 PM, Ali Nazemian wrote: > Dear Emir, > Hi, > There are some cases that I have soft commit in my application. However, > the bulk update part has only hard commit for a bulk of 2500 documents. > Here are some information about the whole indexing/updating scenarios: > - Indexing part uses soft commit. > - In a single update cases soft commit is used. > - For bulk update batch hard commit is used (on 2500 documents) > - Auto hard commit :120 sec > - Auto soft commit: disable > > Best regards. > > > On Tue, Dec 8, 2015 at 12:35 PM, Emir Arnautovic < > emir.arnauto...@sematext.com> wrote: > >> Hi Ali, >> This thread is blocked because cannot obtain update lock - in this >> particular case when doing soft commit. I am guessing that there others are >> blocked for the same reason. Can you tell us bit more about your setup and >> indexing load and procedure? Do you do explicit commits? >> >> Regards, >> Emir >> >> -- >> Monitoring * Alerting * Anomaly Detection * Centralized Log Management >> Solr & Elasticsearch Support * http://sematext.com/ >> >> >> >> On 08.12.2015 08:16, Ali Nazemian wrote: >> >>> Hi, >>> There is a while since I have had problem with Solr 5.2.1 and I could not >>> fix it yet. The only think that is clear to me is when I send bulk update >>> to Solr the commit thread will be blocked! Here is the thread dump >>> output: >>> >>> "qtp595445781-8207" prio=10 tid=0x7f0bf68f5800 nid=0x5785 waiting for >>> monitor entry [0x7f081cf04000] >>> java.lang.Thread.State: BLOCKED (on object monitor) >>> at >>> >>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:608) >>> - waiting to lock <0x00067ba2e660> (a java.lang.Object) >>> at >>> >>> org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95) >>> at >>> >>> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) >>> at >>> >>> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1635) >>> at >>> >>> org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1612) >>> at >>> >>> org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:161) >>> at >>> >>> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) >>> at >>> >>> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) >>> at >>> >>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:270) >>> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177) >>> at >>> >>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98) >>> at >>> >>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) >>> at >>> >>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143) >>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064) >>> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654) >>> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450) >>> at >>> >>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227) >>> at >>> >>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196) >>> at >>> >>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652) >>> at >>> >>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) >>> at >>> >>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) >>> at >>> >>> org.eclipse.jetty.security.SecurityHan
Re: Solr 5.2.1 deadlock on commit
I did that already. The situation was worse. The autocommit part makes solr unavailable. On Dec 8, 2015 7:13 PM, "Emir Arnautovic" wrote: > Hi Ali, > Can you try without explicit commits and see if threads will still be > blocked. > > Thanks, > Emir > > On 08.12.2015 16:19, Ali Nazemian wrote: > >> The indexing load is as follows: >> - Around 1000 documents every 5 mins. >> - The indexing speed is slow because of the complicated analyzer which is >> applied to each document. It takes around 60 seconds to index 1000 >> documents with applying this analyzer (It is really slow. However, based >> on >> the analyzing part I think it would be acceptable). >> - The concurrentsolrclient is used in all the indexing/updating cases. >> >> Regards. >> >> On Tue, Dec 8, 2015 at 6:36 PM, Ali Nazemian >> wrote: >> >> Dear Emir, >>> Hi, >>> There are some cases that I have soft commit in my application. However, >>> the bulk update part has only hard commit for a bulk of 2500 documents. >>> Here are some information about the whole indexing/updating scenarios: >>> - Indexing part uses soft commit. >>> - In a single update cases soft commit is used. >>> - For bulk update batch hard commit is used (on 2500 documents) >>> - Auto hard commit :120 sec >>> - Auto soft commit: disable >>> >>> Best regards. >>> >>> >>> On Tue, Dec 8, 2015 at 12:35 PM, Emir Arnautovic < >>> emir.arnauto...@sematext.com> wrote: >>> >>> Hi Ali, >>>> This thread is blocked because cannot obtain update lock - in this >>>> particular case when doing soft commit. I am guessing that there others >>>> are >>>> blocked for the same reason. Can you tell us bit more about your setup >>>> and >>>> indexing load and procedure? Do you do explicit commits? >>>> >>>> Regards, >>>> Emir >>>> >>>> -- >>>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management >>>> Solr & Elasticsearch Support * http://sematext.com/ >>>> >>>> >>>> >>>> On 08.12.2015 08:16, Ali Nazemian wrote: >>>> >>>> Hi, >>>>> There is a while since I have had problem with Solr 5.2.1 and I could >>>>> not >>>>> fix it yet. The only think that is clear to me is when I send bulk >>>>> update >>>>> to Solr the commit thread will be blocked! Here is the thread dump >>>>> output: >>>>> >>>>> "qtp595445781-8207" prio=10 tid=0x7f0bf68f5800 nid=0x5785 waiting >>>>> for >>>>> monitor entry [0x7f081cf04000] >>>>> java.lang.Thread.State: BLOCKED (on object monitor) >>>>> at >>>>> >>>>> >>>>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:608) >>>>> - waiting to lock <0x00067ba2e660> (a java.lang.Object) >>>>> at >>>>> >>>>> >>>>> org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95) >>>>> at >>>>> >>>>> >>>>> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) >>>>> at >>>>> >>>>> >>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1635) >>>>> at >>>>> >>>>> >>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1612) >>>>> at >>>>> >>>>> >>>>> org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:161) >>>>> at >>>>> >>>>> >>>>> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) >>>>> at >>>>> >>>>> >>>>> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) >>>>> at >>>>> >>>>> >>>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:270) >>>>> at org.apache.sol
Re: Solr 5.2.1 deadlock on commit
I really appreciate if somebody can help me to solve this problem. Regards. On Tue, Dec 8, 2015 at 9:22 PM, Ali Nazemian wrote: > I did that already. The situation was worse. The autocommit part makes > solr unavailable. > On Dec 8, 2015 7:13 PM, "Emir Arnautovic" > wrote: > >> Hi Ali, >> Can you try without explicit commits and see if threads will still be >> blocked. >> >> Thanks, >> Emir >> >> On 08.12.2015 16:19, Ali Nazemian wrote: >> >>> The indexing load is as follows: >>> - Around 1000 documents every 5 mins. >>> - The indexing speed is slow because of the complicated analyzer which is >>> applied to each document. It takes around 60 seconds to index 1000 >>> documents with applying this analyzer (It is really slow. However, based >>> on >>> the analyzing part I think it would be acceptable). >>> - The concurrentsolrclient is used in all the indexing/updating cases. >>> >>> Regards. >>> >>> On Tue, Dec 8, 2015 at 6:36 PM, Ali Nazemian >>> wrote: >>> >>> Dear Emir, >>>> Hi, >>>> There are some cases that I have soft commit in my application. However, >>>> the bulk update part has only hard commit for a bulk of 2500 documents. >>>> Here are some information about the whole indexing/updating scenarios: >>>> - Indexing part uses soft commit. >>>> - In a single update cases soft commit is used. >>>> - For bulk update batch hard commit is used (on 2500 documents) >>>> - Auto hard commit :120 sec >>>> - Auto soft commit: disable >>>> >>>> Best regards. >>>> >>>> >>>> On Tue, Dec 8, 2015 at 12:35 PM, Emir Arnautovic < >>>> emir.arnauto...@sematext.com> wrote: >>>> >>>> Hi Ali, >>>>> This thread is blocked because cannot obtain update lock - in this >>>>> particular case when doing soft commit. I am guessing that there >>>>> others are >>>>> blocked for the same reason. Can you tell us bit more about your setup >>>>> and >>>>> indexing load and procedure? Do you do explicit commits? >>>>> >>>>> Regards, >>>>> Emir >>>>> >>>>> -- >>>>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management >>>>> Solr & Elasticsearch Support * http://sematext.com/ >>>>> >>>>> >>>>> >>>>> On 08.12.2015 08:16, Ali Nazemian wrote: >>>>> >>>>> Hi, >>>>>> There is a while since I have had problem with Solr 5.2.1 and I could >>>>>> not >>>>>> fix it yet. The only think that is clear to me is when I send bulk >>>>>> update >>>>>> to Solr the commit thread will be blocked! Here is the thread dump >>>>>> output: >>>>>> >>>>>> "qtp595445781-8207" prio=10 tid=0x7f0bf68f5800 nid=0x5785 waiting >>>>>> for >>>>>> monitor entry [0x7f081cf04000] >>>>>> java.lang.Thread.State: BLOCKED (on object monitor) >>>>>> at >>>>>> >>>>>> >>>>>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:608) >>>>>> - waiting to lock <0x00067ba2e660> (a java.lang.Object) >>>>>> at >>>>>> >>>>>> >>>>>> org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95) >>>>>> at >>>>>> >>>>>> >>>>>> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64) >>>>>> at >>>>>> >>>>>> >>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1635) >>>>>> at >>>>>> >>>>>> >>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1612) >>>>>> at >>>>>> >>>>>> >>>>>> org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:161) >>>>>> at >>>>>> >>>>>> >>>>>>
Re: Solr 5.2.1 deadlock on commit
Dear Emir, Hi, Actually Solr is in a deadlock state it will not accept any new document. (some of them will store in tlog and some of them not) However, It will response to the new query requests very slowly. Unfortunately right now I have not any access to full thread dump. But, as I mentioned, it is full of thread in blocked state. P.S: I am using 20 threads for the indexing part. I am suspicious of auto hard commit part. Since the indexing/updating part is really slow for the sake of complicate analyzer, it is possible that updating 2500 documents takes more than 120 seconds so before finishing the first hard commit second hard commit would arrive same for third and forth and so on. Therefore it might possible that lots of commit thread would be active at the same time with lots of documents in memory that are not flushed to disk yet. However, I am not sure that such scenario could take Solr threads to deadlock state! Best regards. On Fri, Dec 11, 2015 at 1:02 PM, Emir Arnautovic < emir.arnauto...@sematext.com> wrote: > Hi Ali, > Is Solr busy at that time and eventually recover or it is deadlocked? Can > you provide full thread dump when it happened? > Do you run only indexing at that time? Is "unavailable" only from indexing > perspective, or you cannot do anything with Solr? > Is there any indexing scenario that does not cause this (extreme/useless > one is without commits)? > Did you try throttling indexing or changing bulk size? > How many indexing threads? > > Thanks, > Emir > > > On 11.12.2015 10:06, Ali Nazemian wrote: > >> I really appreciate if somebody can help me to solve this problem. >> Regards. >> >> On Tue, Dec 8, 2015 at 9:22 PM, Ali Nazemian >> wrote: >> >> I did that already. The situation was worse. The autocommit part makes >>> solr unavailable. >>> On Dec 8, 2015 7:13 PM, "Emir Arnautovic" >>> wrote: >>> >>> Hi Ali, >>>> Can you try without explicit commits and see if threads will still be >>>> blocked. >>>> >>>> Thanks, >>>> Emir >>>> >>>> On 08.12.2015 16:19, Ali Nazemian wrote: >>>> >>>> The indexing load is as follows: >>>>> - Around 1000 documents every 5 mins. >>>>> - The indexing speed is slow because of the complicated analyzer which >>>>> is >>>>> applied to each document. It takes around 60 seconds to index 1000 >>>>> documents with applying this analyzer (It is really slow. However, >>>>> based >>>>> on >>>>> the analyzing part I think it would be acceptable). >>>>> - The concurrentsolrclient is used in all the indexing/updating cases. >>>>> >>>>> Regards. >>>>> >>>>> On Tue, Dec 8, 2015 at 6:36 PM, Ali Nazemian >>>>> wrote: >>>>> >>>>> Dear Emir, >>>>> >>>>>> Hi, >>>>>> There are some cases that I have soft commit in my application. >>>>>> However, >>>>>> the bulk update part has only hard commit for a bulk of 2500 >>>>>> documents. >>>>>> Here are some information about the whole indexing/updating scenarios: >>>>>> - Indexing part uses soft commit. >>>>>> - In a single update cases soft commit is used. >>>>>> - For bulk update batch hard commit is used (on 2500 documents) >>>>>> - Auto hard commit :120 sec >>>>>> - Auto soft commit: disable >>>>>> >>>>>> Best regards. >>>>>> >>>>>> >>>>>> On Tue, Dec 8, 2015 at 12:35 PM, Emir Arnautovic < >>>>>> emir.arnauto...@sematext.com> wrote: >>>>>> >>>>>> Hi Ali, >>>>>> >>>>>>> This thread is blocked because cannot obtain update lock - in this >>>>>>> particular case when doing soft commit. I am guessing that there >>>>>>> others are >>>>>>> blocked for the same reason. Can you tell us bit more about your >>>>>>> setup >>>>>>> and >>>>>>> indexing load and procedure? Do you do explicit commits? >>>>>>> >>>>>>> Regards, >>>>>>> Emir >>>>>>> >>>>>>> -- >>>>>>> Monitoring * Alerting * Anomaly Detection * Centralized Log >>>>>>> Mana
Solr query performace
Dear all, Hi, I was wondering is there any performance comparison available for different solr queries? I meant what is the cost of different Solr queries from memory and CPU points of view? I am looking for a report that could help me in case of having different alternatives for sending single query to Solr. Thank you very much. Best regards. -- A.Nazemian
filtering tfq() function query to specific part of collection not the whole documents
Hi, I was wondering is it possible to filter tfq() function query to specific selection of collection? Suppose I want to count all occurrences of term "test" in documents with fq=category:2, how can I handle such query with tfq() function query? It seems applying fq=category:2 in a "select" query with considering tfq() does not affect tfq(), no matter what is the other part of my query, tfq() always return the total term frequency for specific field in the whole collection. So what is the solution for this case? Best regards. -- A.Nazemian
Custom updateProcessor for purpose of extracting interesting terms at index time
Dear All, Hi, I wrote a customize updateProcessorFactory for the purpose of extracting interesting terms at index time an putting them in a new field. Since I use MLT interesting terms for this purpose, I have to make sure that the added document exists in index or not. If it was indexed before there is no problem for MLT interesting terms. But if it is a new document I have to index this document before calling MLT interesting terms. Here is a small part of my code that ran me into problem: if (!isDocIndexed(cmd.getIndexedId())) { // Do not extract keyword since it is not indexed yet super.processAdd(cmd); processAdd(cmd); return; } My problem is the core.getRealtimeSearcher() method does not change after calling super.processAdd(cmd). Therefore such part of code causes infinite loop! Would you please guide me how can I make sure that my custom updateProcessorFactory run at the end of indexing process. (in order to using MLT interesting terms without having concern about the existence of document in index. Best regards. -- A.Nazemian
Re: Custom updateProcessor for purpose of extracting interesting terms at index time
Dear Alex, Hi, I am not sure about what would be the best way of doing such process, Would you please provide me some detail example about doing that on commit? like spell checker that you mentioned? Is is possible to do that using a custom analyzer on a copy field? In order to use MLT interesting terms I should have access to SolrIndexSearcher. I am not sure that can I have access to SolrIndexSearcher in analyzer or not? Best regards. On Mon, Mar 23, 2015 at 11:51 PM, Alexandre Rafalovitch wrote: > So, for a new document. You want to index the document, then read it, > then add keywords and index again? This does sound like an infinite > loop. Not sure there is a solution for this approach. > > You sure you cannot do it like spell checker does with compiling a > side-car index on commit? Or even with some sort of periodic trigger > and update command issues on previously indexed but not post-processed > documents. > > Regards, >Alex. > > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > http://www.solr-start.com/ > > > On 23 March 2015 at 15:07, Ali Nazemian wrote: > > Dear All, > > Hi, > > I wrote a customize updateProcessorFactory for the purpose of extracting > > interesting terms at index time an putting them in a new field. Since I > use > > MLT interesting terms for this purpose, I have to make sure that the > added > > document exists in index or not. If it was indexed before there is no > > problem for MLT interesting terms. But if it is a new document I have to > > index this document before calling MLT interesting terms. > > Here is a small part of my code that ran me into problem: > > > > if (!isDocIndexed(cmd.getIndexedId())) { > > // Do not extract keyword since it is not indexed yet > > super.processAdd(cmd); > > processAdd(cmd); > > return; > > } > > > > My problem is the core.getRealtimeSearcher() method does not change after > > calling super.processAdd(cmd). Therefore such part of code causes > infinite > > loop! Would you please guide me how can I make sure that my custom > > updateProcessorFactory run at the end of indexing process. (in order to > > using MLT interesting terms without having concern about the existence of > > document in index. > > > > Best regards. > > -- > > A.Nazemian > -- A.Nazemian
filtering indexed documents with multiple filters
Dear all, Hi, I am looking for a way to filtering lucene index with multiple conditions. For this purpose I checked two different method of filtering search, none of them work for me: Using BooleanQuery: BooleanQuery query = new BooleanQuery(); String lower = "*"; String upper = "*"; for (String fieldName : keywordSourceFields) { TermRangeQuery rangeQuery = TermRangeQuery.newStringRange(fieldName, lower, upper, true, true); query.add(rangeQuery, Occur.MUST); } TermRangeQuery rangeQuery = TermRangeQuery.newStringRange(keywordField, lower, upper, true, true); query.add(rangeQuery, Occur.MUST_NOT); try { TopDocs results = searcher.search(query, null, maxNumDocs); Using BooleanFilter: BooleanFilter filter = new BooleanFilter(); String lower = "*"; String upper = "*"; for (String fieldName : keywordSourceFields) { TermRangeFilter rangeFilter = TermRangeFilter.newStringRange(fieldName, lower, upper, true, true); filter.add(rangeFilter, Occur.MUST_NOT); } TermRangeFilter rangeFilter = TermRangeFilter.newStringRange(keywordField, lower, upper, true, true); filter.add(rangeFilter, Occur.MUST); try { TopDocs results = searcher.search(new MatchAllDocsQuery(), filter, maxNumDocs); I was wondering what part of chosen queries are wrong? I am looking for documents that for each keywordSourceFields, the field has some value AND also has not value for keyword field. Please guide me through correcting the corresponding query. Best regards. -- A.Nazemian
Lucene indexWriter update does not affect Solr search
I implement a small code for the purpose of extracting some keywords out of Lucene index. I did implement that using search component. My problem is when I tried to update Lucene IndexWriter, Solr index which is placed on top of that, does not affect. As you can see I did the commit part. BooleanQuery query = new BooleanQuery(); for (String fieldName : keywordSourceFields) { TermQuery termQuery = new TermQuery(new Term(fieldName,"N/A")); query.add(termQuery, Occur.MUST_NOT); } TermQuery termQuery=new TermQuery(new Term(keywordField, "N/A")); query.add(termQuery, Occur.MUST); try { //Query q= new QueryParser(keywordField, new StandardAnalyzer()).parse(query.toString()); TopDocs results = searcher.search(query, maxNumDocs); ScoreDoc[] hits = results.scoreDocs; IndexWriter writer = getLuceneIndexWriter(searcher.getPath()); for (int i = 0; i < hits.length; i++) { Document document = searcher.doc(hits[i].doc); List keywords = keyword.getKeywords(hits[i].doc); if(keywords.size()>0) document.removeFields(keywordField); for (String word : keywords) { document.add(new StringField(keywordField, word, Field.Store.YES)); } String uniqueKey = searcher.getSchema().getUniqueKeyField().getName(); writer.updateDocument(new Term(uniqueKey, document.get(uniqueKey)), document); } writer.commit(); writer.forceMerge(1); writer.close(); } catch (IOException | SyntaxError e) { throw new RuntimeException(); } Please help me through solving this problem. -- A.Nazemian
Re: Lucene indexWriter update does not affect Solr search
Dear Upayavira, Hi, It is just the part of my code in which caused the problem. I know searchComponent is not for changing the index, but for the purpose of extracting document keywords I was forced to hack searchComponent for extracting keywords and putting them into index. For more information about why I chose searchComponent at the first place please follow this link: https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201503.mbox/browser Best regards. On Tue, Apr 7, 2015 at 5:30 PM, Upayavira wrote: > What are you trying to do? A search component is not intended for > updating the index, so it really doesn’t surprise me that you aren’t > seeing updates. > > I’d suggest you describe the problem you are trying to solve before > proposing solutions. > > Upayavira > > > On Tue, Apr 7, 2015, at 01:32 PM, Ali Nazemian wrote: > > I implement a small code for the purpose of extracting some keywords out > > of > > Lucene index. I did implement that using search component. My problem is > > when I tried to update Lucene IndexWriter, Solr index which is placed on > > top of that, does not affect. As you can see I did the commit part. > > > > BooleanQuery query = new BooleanQuery(); > > for (String fieldName : keywordSourceFields) { > > TermQuery termQuery = new TermQuery(new Term(fieldName,"N/A")); > > query.add(termQuery, Occur.MUST_NOT); > > } > > TermQuery termQuery=new TermQuery(new Term(keywordField, "N/A")); > > query.add(termQuery, Occur.MUST); > > try { > > //Query q= new QueryParser(keywordField, new > > StandardAnalyzer()).parse(query.toString()); > > TopDocs results = searcher.search(query, > > maxNumDocs); > > ScoreDoc[] hits = results.scoreDocs; > > IndexWriter writer = getLuceneIndexWriter(searcher.getPath()); > > for (int i = 0; i < hits.length; i++) { > > Document document = searcher.doc(hits[i].doc); > > List keywords = keyword.getKeywords(hits[i].doc); > > if(keywords.size()>0) document.removeFields(keywordField); > > for (String word : keywords) { > > document.add(new StringField(keywordField, word, > > Field.Store.YES)); > > } > > String uniqueKey = > > searcher.getSchema().getUniqueKeyField().getName(); > > writer.updateDocument(new Term(uniqueKey, > > document.get(uniqueKey)), > > document); > > } > > writer.commit(); > > writer.forceMerge(1); > > writer.close(); > > } catch (IOException | SyntaxError e) { > > throw new RuntimeException(); > > } > > > > Please help me through solving this problem. > > > > -- > > A.Nazemian > -- A.Nazemian
Re: Lucene indexWriter update does not affect Solr search
I did some investigation and found out that the retrieving part of documents works fine while Solr did not restarted. But the searching part of documents did not work. After I restarted Solr it seems that the core corrupted and failed to start! Here is the corresponding log: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.(SolrCore.java:896) at org.apache.solr.core.SolrCore.(SolrCore.java:662) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:513) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:278) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:272) at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) at java.util.concurrent.FutureTask.run(FutureTask.java:166) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Caused by: org.apache.solr.common.SolrException: Error opening new searcher at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1604) at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1716) at org.apache.solr.core.SolrCore.(SolrCore.java:868) ... 9 more Caused by: org.apache.lucene.index.IndexNotFoundException: no segments* file found in NRTCachingDirectory(MMapDirectory@C:\Users\Ali\workspace\lucene_solr_5_0_0\solr\server\solr\document\data\index lockFactory=org.apache.lucene.store.SimpleFSLockFactory@3bf76891; maxCacheMB=48.0 maxMergeSizeMB=4.0): files: [_2_Lucene50_0.doc, write.lock, _2_Lucene50_0.pos, _2.nvd, _2.fdt, _2_Lucene50_0.tim] at org.apache.lucene.index.IndexWriter.(IndexWriter.java:821) at org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:78) at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:65) at org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:272) at org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:115) at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1573) ... 11 more 4/7/2015, 6:53:26 PM ERROR SolrIndexWriter SolrIndexWriter was not closed prior to finalize(), indicates a bug -- POSSIBLE RESOURCE LEAK!!! 4/7/2015, 6:53:26 PM ERROR SolrIndexWriter Error closing IndexWriter java.lang.NullPointerException at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:2959) at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2927) at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:965) at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1010) at org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:130) at org.apache.solr.update.SolrIndexWriter.finalize(SolrIndexWriter.java:183) at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method) at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:101) at java.lang.ref.Finalizer.access$100(Finalizer.java:32) at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:190) There for my guess would be problem with indexing the keywordField and also problem related to closing the IndexWriter. On Tue, Apr 7, 2015 at 6:13 PM, Ali Nazemian wrote: > Dear Upayavira, > Hi, > It is just the part of my code in which caused the problem. I know > searchComponent is not for changing the index, but for the purpose of > extracting document keywords I was forced to hack searchComponent for > extracting keywords and putting them into index. > For more information about why I chose searchComponent at the first place > please follow this link: > > https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201503.mbox/browser > > Best regards. > > > On Tue, Apr 7, 2015 at 5:30 PM, Upayavira wrote: > >> What are you trying to do? A search component is not intended for >> updating the index, so it really doesn’t surprise me that you aren’t >> seeing updates. >> >> I’d suggest you describe the problem you are trying to solve before >> proposing solutions. >> >> Upayavira >> >> >> On Tue, Apr 7, 2015, at 01:32 PM, Ali Nazemian wrote: >> > I implement a small code for the purpose of extracting some keywords out >> > of >> > Lucene index. I did implement that using search component. My problem is >> > when I tried to update Lucene IndexWriter, Solr index which is placed on >> > top of that, does not affect. As you can see I did the commit part. >> > >> > BooleanQuery query = new BooleanQuery(); >> > for (String fieldName : keywordSourceFields) { >> > TermQuery termQuery = new TermQuery(new >> Term(fieldName,"N/A")); >> > query
Lucene updateDocument does not affect index until restarting solr
Dear all, Hi, As a part of my code I have to update Lucene document. For this purpose I used writer.updateDocument() method. My problem is the update process is not affect index until restarting Solr. Would you please tell me what part of my code is wrong? Or what should I add in order to apply the changes? RefCounted iw = solrCoreState.getIndexWriter(core); try { IndexWriter writer = iw.get(); FieldType type= new FieldType(StringField.TYPE_STORED); for (int i = 0; i < hits.length; i++) { Document document = searcher.doc(hits[i].doc); List keywords = keyword.getKeywords(hits[i].doc); if (keywords.size() > 0) document.removeFields(keywordField); for (String word : keywords) { document.add(new Field(keywordField, word, type)); } String uniqueKey = searcher.getSchema().getUniqueKeyField().getName(); writer.updateDocument(new Term(uniqueKey, document.get(uniqueKey)), document); } writer.commit(); } finally { iw.decref(); } Best regards. -- A.Nazemian
Problem related to filter on Zero value for DateField
Dears, Hi, I have strange problem with Solr 4.10.x. My problem is when I do searching on solr Zero date which is "0002-11-30T00:00:00Z" if more than one filter be considered, the results became invalid. For example consider this scenario: When I search for a document with fq=p_date:"0002-11-30T00:00:00Z" Solr returns three different documents which is right for my Collection. All of these three documents have same value of "7" for document status. Now If I search for fq=document_status:7 the same three documents returns which is also a correct response. But When I do the searching on fq=focument_status:7&fq=p_date:"0002-11-30T00:00:00Z", Solr returns nothing! (0 document) While I have not such problem with other date values beside Solr Zero ("0002-11-30T00:00:00Z"). Please let me know it is a bug related to Solr or I did something wrong? Best regards. -- A.Nazemian
Re: Problem related to filter on Zero value for DateField
Dear Jack, Hi, The q parameter is *:* since I just wanted to filter the documents. Regards. On Tue, Apr 14, 2015 at 8:07 PM, Jack Krupansky wrote: > What does your main query look like? Normally we don't speak of "searching" > with the fq parameter - it filters the results, but the actual searching is > done via the main query with the q parameter. > > -- Jack Krupansky > > On Tue, Apr 14, 2015 at 4:17 AM, Ali Nazemian > wrote: > > > Dears, > > Hi, > > I have strange problem with Solr 4.10.x. My problem is when I do > searching > > on solr Zero date which is "0002-11-30T00:00:00Z" if more than one filter > > be considered, the results became invalid. For example consider this > > scenario: > > When I search for a document with fq=p_date:"0002-11-30T00:00:00Z" Solr > > returns three different documents which is right for my Collection. All > of > > these three documents have same value of "7" for document status. Now If > I > > search for fq=document_status:7 the same three documents returns which is > > also a correct response. But When I do the searching on > > fq=focument_status:7&fq=p_date:"0002-11-30T00:00:00Z", Solr returns > > nothing! (0 document) While I have not such problem with other date > values > > beside Solr Zero ("0002-11-30T00:00:00Z"). Please let me know it is a bug > > related to Solr or I did something wrong? > > Best regards. > > > > -- > > A.Nazemian > > > -- A.Nazemian
Re: Lucene updateDocument does not affect index until restarting solr
Dear Chris, Hi, Thank you for your response. Actually I implemented a small code for the purpose of extracting article keywords out of Lucene index on commit, optimize or calling the specific query. I did implement that using search component. I know that the searchComponent is not for the purpose of updating index, but it was suggested in Solr mailing list at the first place and it seems it is the most possible solution according to Solr extension points. Anyway for more information about why I chose searchComponent at the first place please take a look at this <https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201503.mbox/browser> link. Best regards. On Wed, Apr 15, 2015 at 10:00 PM, Chris Hostetter wrote: > > the short answer is that you need something to re-open the searcher -- but > i'm not going to go into specifics on how to do that because... > > You are dealing with a VERY low level layer of the lucene/solr code stack > -- w/o more details on why you've written this particular bit of code (and > where in the solr stack this code lives) it's hard to give you general > advice on the best way to proceed and i don't wnat to encourage you along > a dangerous path when there are likely much > easier/better/safer/more-supported ways to do what you are trying to do -- > you just need to explain to us what that is. > > https://people.apache.org/~hossman/#xyproblem > XY Problem > > Your question appears to be an "XY Problem" ... that is: you are dealing > with "X", you are assuming "Y" will help you, and you are asking about "Y" > without giving more details about the "X" so that we can understand the > full issue. Perhaps the best solution doesn't involve "Y" at all? > See Also: http://www.perlmonks.org/index.pl?node_id=542341 > > > > > : Date: Thu, 9 Apr 2015 01:02:16 +0430 > : From: Ali Nazemian > : Reply-To: solr-user@lucene.apache.org > : To: "solr-user@lucene.apache.org" > : Subject: Lucene updateDocument does not affect index until restarting > solr > : > : Dear all, > : Hi, > : As a part of my code I have to update Lucene document. For this purpose I > : used writer.updateDocument() method. My problem is the update process is > : not affect index until restarting Solr. Would you please tell me what > part > : of my code is wrong? Or what should I add in order to apply the changes? > : > : RefCounted iw = solrCoreState.getIndexWriter(core); > : try { > : IndexWriter writer = iw.get(); > : FieldType type= new FieldType(StringField.TYPE_STORED); > : for (int i = 0; i < hits.length; i++) { > : Document document = searcher.doc(hits[i].doc); > : List keywords = keyword.getKeywords(hits[i].doc); > : if (keywords.size() > 0) document.removeFields(keywordField); > : for (String word : keywords) { > : document.add(new Field(keywordField, word, type)); > : } > : String uniqueKey = > : searcher.getSchema().getUniqueKeyField().getName(); > : writer.updateDocument(new Term(uniqueKey, > : document.get(uniqueKey)), > : document); > : } > : writer.commit(); > : } finally { > : iw.decref(); > : } > : > : > : Best regards. > : > : -- > : A.Nazemian > : > > -Hoss > http://www.lucidworks.com/ > -- A.Nazemian
Date Format Conversion Function Query
Dear all, Hi, I was wondering is there any function query for converting date format in Solr? If no, how can I implement such function query myself? -- A.Nazemian
Re: Date Format Conversion Function Query
Dear Erick, Hi, Actually I want to convert date format from Geregorian calendar (solr default) to Perisan calendar. You may ask why i do not do that at client side? Here is why: I want to provide a way to extract data from solr in the csv format. I know that solr has csv ResponseWriter that could be used in this case. But my problem is that the date format in solr index is provided by Geregorian calendar and I want to put that in Persian calendar. Therefore I was thinking of a function query to do that at query time for me. Regards. On Tue, Jun 9, 2015 at 10:55 PM, Erick Erickson wrote: > I'm not sure what you're asking for, give us an example input/output pair? > > Best, > Erick > > On Tue, Jun 9, 2015 at 8:47 AM, Ali Nazemian > wrote: > > Dear all, > > Hi, > > I was wondering is there any function query for converting date format in > > Solr? If no, how can I implement such function query myself? > > > > -- > > A.Nazemian > -- A.Nazemian
Re: Date Format Conversion Function Query
Thank you very much. It seems that document transformer is the perfect extension point for this conversion. I will try to implement that. Best regards. On Wed, Jun 10, 2015 at 3:54 PM, Upayavira wrote: > Another technology that might make more sense is a Doc Transformer. > > You also specify them in the fl parameter. I would imagine you could > specify > > fl=id,[persian f=gregorian_Date] > > See here for more cases: > > > https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents > > This does not exist right now, but would make a good contribution to > Solr itself, I'd say. > > Upayavira > > On Wed, Jun 10, 2015, at 09:57 AM, Alessandro Benedetti wrote: > > Erick will correct me if I am wrong but this function query I don't think > > it exists. > > But maybe can be a nice contribution. > > It should take in input a date format and a field and give in response > > the > > new formatted Date. > > > > The would be simple to use it : > > > > fl=id,persian_date:dateFormat("/mm/dd",gregorian_Date) > > > > The date format is an example in input is an example. > > > > Cheers > > > > 2015-06-10 7:24 GMT+01:00 Ali Nazemian : > > > > > Dear Erick, > > > Hi, > > > Actually I want to convert date format from Geregorian calendar (solr > > > default) to Perisan calendar. You may ask why i do not do that at > client > > > side? Here is why: > > > > > > I want to provide a way to extract data from solr in the csv format. I > know > > > that solr has csv ResponseWriter that could be used in this case. But > my > > > problem is that the date format in solr index is provided by Geregorian > > > calendar and I want to put that in Persian calendar. Therefore I was > > > thinking of a function query to do that at query time for me. > > > > > > Regards. > > > > > > On Tue, Jun 9, 2015 at 10:55 PM, Erick Erickson < > erickerick...@gmail.com> > > > wrote: > > > > > > > I'm not sure what you're asking for, give us an example input/output > > > pair? > > > > > > > > Best, > > > > Erick > > > > > > > > On Tue, Jun 9, 2015 at 8:47 AM, Ali Nazemian > > > > wrote: > > > > > Dear all, > > > > > Hi, > > > > > I was wondering is there any function query for converting date > format > > > in > > > > > Solr? If no, how can I implement such function query myself? > > > > > > > > > > -- > > > > > A.Nazemian > > > > > > > > > > > > > > > > -- > > > A.Nazemian > > > > > > > > > > > -- > > -- > > > > Benedetti Alessandro > > Visiting card : http://about.me/alessandro_benedetti > > > > "Tyger, tyger burning bright > > In the forests of the night, > > What immortal hand or eye > > Could frame thy fearful symmetry?" > > > > William Blake - Songs of Experience -1794 England > -- A.Nazemian
Extracting article keywords using tf-idf algorithm
Dear Lucene/Solr developers, Hi, I decided to develop a plugin for Solr in order to extract main keywords from article. Since Solr already did the hard-working for calculating tf-idf scores I decided to use that for the sake of better performance. I know that UpdateRequestProcessor is the best suited extension point for adding keyword value to documents. I also find out that I have not any access to tf-idf scores inside the UpdateRequestProcessor, because of the fact that UpdateRequestProcessor chain will be applied before the process of calculating tf-idf scores. Hence, with consulting with Solr/Lucene developers I decided to go for searchComponent in order to calculate keywords based on tf-idf (Lucene Interesting Terms) on commit/optimize. Unfortunately toward this approach, strange core behavior was observed. For example sometimes facet wont work on this keyword field or the index becomes unstable in search results. I really appreciate if someone help me to make it stable. NamedList response = new SimpleOrderedMap(); keyword.init(searcher, params); BooleanQuery query = new BooleanQuery(); for (String fieldName : keywordSourceFields) { TermQuery termQuery = new TermQuery(new Term(fieldName, "noval")); query.add(termQuery, Occur.MUST_NOT); } TermQuery termQuery = new TermQuery(new Term(keywordField, "noval")); query.add(termQuery, Occur.MUST); RefCounted iw = null; IndexWriter writer = null; try { TopDocs results = searcher.search(query, maxNumDocs); ScoreDoc[] hits = results.scoreDocs; iw = solrCoreState.getIndexWriter(core); writer = iw.get(); FieldType type = new FieldType(StringField.TYPE_STORED); for (int i = 0; i < hits.length; i++) { Document document = searcher.doc(hits[i].doc); List keywords = keyword.getKeywords(hits[i].doc); if (keywords.size() > 0) document.removeFields(keywordField); for (String word : keywords) { document.add(new Field(keywordField, word, type)); } String uniqueKey = searcher.getSchema().getUniqueKeyField().getName(); writer.updateDocument(new Term(uniqueKey, document.get(uniqueKey)), document); } response.add("Number of Selected Docs", results.totalHits); writer.commit(); } catch (IOException | SyntaxError e) { throw new RuntimeException(); } finally { if (iw != null) { iw.decref(); } } public List getKeywords(int docId) throws SyntaxError { String[] fields = new String[keywordSourceFields.size()]; List terms = new ArrayList(); fields = keywordSourceFields.toArray(fields); mlt.setFieldNames(fields); mlt.setAnalyzer(indexSearcher.getSchema().getIndexAnalyzer()); mlt.setMinTermFreq(minTermFreq); mlt.setMinDocFreq(minDocFreq); mlt.setMinWordLen(minWordLen); mlt.setMaxQueryTerms(maxNumKeywords); mlt.setMaxNumTokensParsed(maxTokensParsed); try { terms = Arrays.asList(mlt.retrieveInterestingTerms(docId)); } catch (IOException e) { LOGGER.error(e.getMessage()); throw new RuntimeException(); } return terms; } Best regards. -- A.Nazemian
Solr group query based on the sum aggregation of function query
Dear Solr users/developers, Hi, I have tried to implement the Page and Post relation in single Solr Schema. In my use case each page has multiple posts. Page and Post fields are as follows: Post:{post_content, owner_page_id, document_type} Page:{page_id, document_type} Suppose I want to query this single core for the results sorted by the total number of term frequency for specific term per each Page. First I though the following query can help me to overcome this query requirement for term "hello": http://localhost:8983/solr/document/select?wt=json&indent=true&fl=id,name&q=*:*&group=true&group.field=owner_page_id&sort=termfreq(post_content,%27hello%27)+desc&fl=result:termfreq(post_content_text,%27hello%27),owner_page_id But, it seems that this query returns the term frequency for single post of each page and the result is not aggregated for all of the page posts, and I am looking for the aggregate result. I would be really grateful if somebody can help me to find the required query for my requirement. P.S: I am using Solr 6, so JSON Facet is available for me. -- A.Nazemian
Using SolrCloud with RDBMS or without
Hi everybody, I was wondering which scenario (or the combination) would be better for my application. From the aspect of performance, scalability and high availability. Here is my application: Suppose I am going to have more than 10m documents and it grows every day. (probably in 1 years it reaches to more than 100m docs. I want to use Solr as tool for indexing these documents but the problem is I have some data fields that could change frequently. (not too much but it could change) Scenarios: 1- Using SolrCloud as database for all data. (even the one that could be changed) 2- Using SolrCloud as database for static data and using RDBMS (such as oracle) for storing dynamic fields. 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all data. Best regards. -- A.Nazemian
Re: Using SolrCloud with RDBMS or without
The fact that I ignore Cassandra is because of it seems Cassandra is perfect when you have too much write operation. In my case it is true that I have some update operation but for sure read operations are much more than write ones. By the way there are probably more scenarios for my application. My question would be which one is probably the best? Best regards. On Mon, May 26, 2014 at 6:27 PM, Jack Krupansky wrote: > You could also consider DataStax Enterprise, which integrates Apache > Cassandra as the primary database and Solr for indexing and query. > > See: > http://www.datastax.com/what-we-offer/products-services/ > datastax-enterprise > > -- Jack Krupansky > > -----Original Message- From: Ali Nazemian > Sent: Monday, May 26, 2014 9:50 AM > To: solr-user@lucene.apache.org > Subject: Using SolrCloud with RDBMS or without > > > Hi everybody, > > I was wondering which scenario (or the combination) would be better for my > application. From the aspect of performance, scalability and high > availability. Here is my application: > > Suppose I am going to have more than 10m documents and it grows every day. > (probably in 1 years it reaches to more than 100m docs. I want to use Solr > as tool for indexing these documents but the problem is I have some data > fields that could change frequently. (not too much but it could change) > > Scenarios: > > 1- Using SolrCloud as database for all data. (even the one that could be > changed) > > 2- Using SolrCloud as database for static data and using RDBMS (such as > oracle) for storing dynamic fields. > > 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all > data. > > Best regards. > > -- > A.Nazemian > -- A.Nazemian
Re: Using SolrCloud with RDBMS or without
Dear Erick, Thank you for you reply. Some parts of documents come from Nutch crawler and the other parts come from processing those documents. I really need it to be as fast as possible and 10 hours for indexing is not acceptable for my application. Regards. On Mon, May 26, 2014 at 9:25 PM, Erick Erickson wrote: > What you haven't told us is where the data comes from. But until > you put some numbers to it, it's hard to decide. > > I tend to prefer storing the data somewhere else, filesystem, whatever > and indexing to Solr when data changes. Even if that means re-indexing > the entire corpus. I don't like going to more complicated solutions until > that proves untenable. > > Backup/restore solutions for filesystems, DBs, whatever are are a very > mature technology, I rely on that first to store my original source. > > Now you can re-index at will. > > So let's claim your data comes in from some stream somewhere. I'd > 1> store it to the file system. > 2> write a program to pull it off the file system and index. > 3> Your comment about MapReduceIndexerTool is germane. You can re-index > all that data very quickly. And it'll find files on your file system > for you too! > > But I wouldn't even go there until I'd tried > indexing my 10M docs straight with SolrJ or similar. If you can index > your 10M docs > in 1 hour and, by extrapolation your 100M docs in 10 hours, is that good > enough? > I don't know, it's your problem space after all ;). And is it acceptable > to not > see changes to the schema until tomorrow morning? If so, there's no need > to get > more complicated > > Best, > Erick > > On Mon, May 26, 2014 at 9:00 AM, Shawn Heisey wrote: > > On 5/26/2014 7:50 AM, Ali Nazemian wrote: > >> I was wondering which scenario (or the combination) would be better for > my > >> application. From the aspect of performance, scalability and high > >> availability. Here is my application: > >> > >> Suppose I am going to have more than 10m documents and it grows every > day. > >> (probably in 1 years it reaches to more than 100m docs. I want to use > Solr > >> as tool for indexing these documents but the problem is I have some data > >> fields that could change frequently. (not too much but it could change) > > > > Choosing which database software to use to hold your data is a problem > > with many possible solutions. Everyone will have a different answer for > > you. Each solution has strengths and weaknesses, and in the end, only > > you can really know what your requirements are. > > > >> Scenarios: > >> > >> 1- Using SolrCloud as database for all data. (even the one that could be > >> changed) > > > > If you choose to use Solr as a NoSQL, I would strongly recommend that > > you have two Solr installs. The first install would be purely for data > > storage and would have no indexed fields. If you can get machines with > > enough RAM, it would also probably be preferable to use a single index > > (or SolrCloud with one shard) for that install. The other install would > > be for searching. Sharding would not be an issue on that index. The > > reason that I make this recommendation is that when you use Solr for > > searching, you have to do a complete reindex if you change your search > > schema. It's difficult to reindex if the search index is also your > > canonical data source. > > > >> 2- Using SolrCloud as database for static data and using RDBMS (such as > >> oracle) for storing dynamic fields. > > > > I don't think it would be a good idea to have two canonical data > > sources. Pick one. As already mentioned, Solr is better as a search > > technology, serving up pointers to data in another data source, than as > > a database. > > > > If you want to use RDBMS technology, why would you spend all that money > > on Oracle? Just use one of the free databases. Our really large Solr > > index comes from a database. At one time that database was in Oracle. > > When my employer purchased the company with that database, we thought we > > were obtaining a full Oracle license. It turns out we weren't. It > > would have cost about half a million dollars to buy that license, so we > > switched to MySQL. > > > > Since making that move to MySQL, performance is actually *better*. The > > source table for our data has 96 million rows right now, growing at a > > rate of a few million per year. This is completely in line with your > > 100 million document requirement. F
Re: Using SolrCloud with RDBMS or without
Dear Shawn, Hi and thank you for you reply. Could you please tell me about the performance and scalability of the mentioned solutions? Suppose I have a SolrCloud with 4 different machine. Would it scale linearly if I add another 4 machines to that? I mean when the documents number increases from 10m to 100m documents. Regards. On Mon, May 26, 2014 at 8:30 PM, Shawn Heisey wrote: > On 5/26/2014 7:50 AM, Ali Nazemian wrote: > > I was wondering which scenario (or the combination) would be better for > my > > application. From the aspect of performance, scalability and high > > availability. Here is my application: > > > > Suppose I am going to have more than 10m documents and it grows every > day. > > (probably in 1 years it reaches to more than 100m docs. I want to use > Solr > > as tool for indexing these documents but the problem is I have some data > > fields that could change frequently. (not too much but it could change) > > Choosing which database software to use to hold your data is a problem > with many possible solutions. Everyone will have a different answer for > you. Each solution has strengths and weaknesses, and in the end, only > you can really know what your requirements are. > > > Scenarios: > > > > 1- Using SolrCloud as database for all data. (even the one that could be > > changed) > > If you choose to use Solr as a NoSQL, I would strongly recommend that > you have two Solr installs. The first install would be purely for data > storage and would have no indexed fields. If you can get machines with > enough RAM, it would also probably be preferable to use a single index > (or SolrCloud with one shard) for that install. The other install would > be for searching. Sharding would not be an issue on that index. The > reason that I make this recommendation is that when you use Solr for > searching, you have to do a complete reindex if you change your search > schema. It's difficult to reindex if the search index is also your > canonical data source. > > > 2- Using SolrCloud as database for static data and using RDBMS (such as > > oracle) for storing dynamic fields. > > I don't think it would be a good idea to have two canonical data > sources. Pick one. As already mentioned, Solr is better as a search > technology, serving up pointers to data in another data source, than as > a database. > > If you want to use RDBMS technology, why would you spend all that money > on Oracle? Just use one of the free databases. Our really large Solr > index comes from a database. At one time that database was in Oracle. > When my employer purchased the company with that database, we thought we > were obtaining a full Oracle license. It turns out we weren't. It > would have cost about half a million dollars to buy that license, so we > switched to MySQL. > > Since making that move to MySQL, performance is actually *better*. The > source table for our data has 96 million rows right now, growing at a > rate of a few million per year. This is completely in line with your > 100 million document requirement. For the massive table that feeds > Solr, we might switch to MongoDB, but that has not been decided yet. > > Later we switched from EasyAsk to Solr, a move that has *also* given us > better performance. Because both MySQL and Solr are free, we've > achieved a substantial cost savings. > > > 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all > > data. > > I have no experience with this technology, but I think that if you are > thinking about a database on HDFS, you're probably actually talking > about HBase, the Apache implementation of Google's BigTable. > > Thanks, > Shawn > > -- A.Nazemian
solr cross doc join on relational database
Hi every body, I was wondering is there any way for using cross doc join on integraion of one solr core and a relational database. Suppose I have a table in relational database (my sql) name USER. I want to keep track of news that each user can have access. Assume news are stored inside solr and there is no easy way of transferring USER table to solr (because of so many changes that should be done inside other part of my application) So my question would be is there any way of having cross doc join with one document inside Solr and another one inside RDBMS? Best regards. -- A.Nazemian
Re: solr cross doc join on relational database
Thank you very much. I will take a look at that. On Fri, May 30, 2014 at 4:24 PM, Ahmet Arslan wrote: > Hi Ali, > > I did a similar user filtering by indexing user table once per hour, and > filtering results by solr query time join query parser. > > Assuming there is no easy way to transfer USER table to solr, Solr post > filtering is the way to : > > http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/ > > You can connect to your database in it, filter according to rights. ( can > this user see this document?) > > /** > * Note that this Query implementation can _only_ be used as an fq, not as > a q (it would need to implement createWeight). > */ > public class AreaIsOpenControlQuery extends ExtendedQueryBase implements > PostFilter { > > > > On Friday, May 30, 2014 2:26 PM, Ali Nazemian > wrote: > > > > Hi every body, > I was wondering is there any way for using cross doc join on integraion of > one solr core and a relational database. > Suppose I have a table in relational database (my sql) name USER. I want to > keep track of news that each user can have access. Assume news are stored > inside solr and there is no easy way of transferring USER table to solr > (because of so many changes that should be done inside other part of my > application) So my question would be is there any way of having cross doc > join with one document inside Solr and another one inside RDBMS? > Best regards. > -- > A.Nazemian > -- A.Nazemian
Document security filtering in distributed solr (with multi shard)
Dears, Hi, I am going to apply customer security filtering for each document per each user. (using custom profile for each user). I was thinking of adding user fields to index and using solr join for filtering. But It seems for distributed solr this is not a solution. Could you please tell me what the solution would be in this case? Best regards. -- A.Nazemian
Re: Document security filtering in distributed solr (with multi shard)
Dear Alexandre, Yeah I saw that, but what is the best way of doing that from the performance point of view? I think of one solution myself: Suppose we have a RDBMS for users that contains the category and group for each user. (It could be in hierarchical format) Suppose there is a field name "security" in solr index that contains the list of each group or category that is applied to each document. So the query would be filter only documents that its category or group match the specific one for that user. Is this solution works in distributed way? What if we concern about performance? Also I was wondering how lucidworks do that? Best regards. On Tue, Jun 17, 2014 at 4:08 PM, Alexandre Rafalovitch wrote: > Have you looked at Post Filters? I think this was one of the use cases. > > An old article: > http://java.dzone.com/articles/custom-security-filtering-solr . Google > search should bring a couple more. > > Regards, >Alex. > Personal website: http://www.outerthoughts.com/ > Current project: http://www.solr-start.com/ - Accelerating your Solr > proficiency > > > On Tue, Jun 17, 2014 at 6:24 PM, Ali Nazemian > wrote: > > Dears, > > Hi, > > I am going to apply customer security filtering for each document per > each > > user. (using custom profile for each user). I was thinking of adding user > > fields to index and using solr join for filtering. But It seems for > > distributed solr this is not a solution. Could you please tell me what > the > > solution would be in this case? > > Best regards. > > > > -- > > A.Nazemian > -- A.Nazemian
Re: Document security filtering in distributed solr (with multi shard)
Any idea would be appropriate. On Tue, Jun 17, 2014 at 5:44 PM, Ali Nazemian wrote: > Dear Alexandre, > Yeah I saw that, but what is the best way of doing that from the > performance point of view? > I think of one solution myself: > Suppose we have a RDBMS for users that contains the category and group for > each user. (It could be in hierarchical format) Suppose there is a field > name "security" in solr index that contains the list of each group or > category that is applied to each document. So the query would be filter > only documents that its category or group match the specific one for that > user. > Is this solution works in distributed way? What if we concern about > performance? > Also I was wondering how lucidworks do that? > Best regards. > > > On Tue, Jun 17, 2014 at 4:08 PM, Alexandre Rafalovitch > wrote: > >> Have you looked at Post Filters? I think this was one of the use cases. >> >> An old article: >> http://java.dzone.com/articles/custom-security-filtering-solr . Google >> search should bring a couple more. >> >> Regards, >>Alex. >> Personal website: http://www.outerthoughts.com/ >> Current project: http://www.solr-start.com/ - Accelerating your Solr >> proficiency >> >> >> On Tue, Jun 17, 2014 at 6:24 PM, Ali Nazemian >> wrote: >> > Dears, >> > Hi, >> > I am going to apply customer security filtering for each document per >> each >> > user. (using custom profile for each user). I was thinking of adding >> user >> > fields to index and using solr join for filtering. But It seems for >> > distributed solr this is not a solution. Could you please tell me what >> the >> > solution would be in this case? >> > Best regards. >> > >> > -- >> > A.Nazemian >> > > > > -- > A.Nazemian > -- A.Nazemian
solr dedup on specific fields
Hi, I used solr 4.8 for indexing the web pages that come from nutch. I know that solr deduplication operation works on uniquekey field. So I set that to URL field. Everything is OK. except that I want after duplication detection solr try not to delete all fields of old document. I want some fields remain unchanged. For example assume I have a data field called "read" with Boolean value "true" for specific document. I want all fields of new document overwrites except the value of this field. Is that possible? How? Regards. -- A.Nazemian
Re: solr dedup on specific fields
Any suggestion would be appreciated. Regards. On Mon, Jun 30, 2014 at 2:49 PM, Ali Nazemian wrote: > Hi, > I used solr 4.8 for indexing the web pages that come from nutch. I know > that solr deduplication operation works on uniquekey field. So I set that > to URL field. Everything is OK. except that I want after duplication > detection solr try not to delete all fields of old document. I want some > fields remain unchanged. For example assume I have a data field called > "read" with Boolean value "true" for specific document. I want all fields > of new document overwrites except the value of this field. Is that > possible? How? > Regards. > > -- > A.Nazemian > -- A.Nazemian
Re: solr dedup on specific fields
Dears, Is there any way that I can do that in other way? I mean if you look at my main problem again you will find out that I have two types of fields in my documents. 1) The ones that should be overwritten on duplicates, 2) The ones that should not change during duplicates. So Is it another way to handle this situation from the first place? I mean using cross join for example? Assume I have a document with ID 2 which contains all the fields that can be overwritten. And another document with ID 2 which contains all fields that should not change during duplication detection. For selecting all fields it is enough to do join on ID and for Duplication it is enough to overwrite just document type 1. Regards. On Tue, Jul 1, 2014 at 6:17 PM, Alexandre Rafalovitch wrote: > Well, it's implemented in SignatureUpdateProcessorFactory. Worst case, > you can clone that code and add your preserve-field functionality. > Could even be a nice contribution. > > Regards, >Alex. > > Personal website: http://www.outerthoughts.com/ > Current project: http://www.solr-start.com/ - Accelerating your Solr > proficiency > > > On Tue, Jul 1, 2014 at 6:50 PM, Ali Nazemian > wrote: > > Any suggestion would be appreciated. > > Regards. > > > > > > On Mon, Jun 30, 2014 at 2:49 PM, Ali Nazemian > wrote: > > > >> Hi, > >> I used solr 4.8 for indexing the web pages that come from nutch. I know > >> that solr deduplication operation works on uniquekey field. So I set > that > >> to URL field. Everything is OK. except that I want after duplication > >> detection solr try not to delete all fields of old document. I want some > >> fields remain unchanged. For example assume I have a data field called > >> "read" with Boolean value "true" for specific document. I want all > fields > >> of new document overwrites except the value of this field. Is that > >> possible? How? > >> Regards. > >> > >> -- > >> A.Nazemian > >> > > > > > > > > -- > > A.Nazemian > -- A.Nazemian
Re: solr dedup on specific fields
Updating documents will add some extra time to indexing process. (I send the documents via apache Nutch) I prefer to make indexing as fast as possible. On Mon, Jul 7, 2014 at 12:05 PM, Alexandre Rafalovitch wrote: > Can you use Update operation instead of Create? Then, you can supply > only the fields that need to be changed and use atomic update to > preserve the others. But then you will have issues when you _are_ > creating new documents and you do need to store all fields. > > Regards, >Alex. > Personal website: http://www.outerthoughts.com/ > Current project: http://www.solr-start.com/ - Accelerating your Solr > proficiency > > > On Mon, Jul 7, 2014 at 2:08 PM, Ali Nazemian > wrote: > > Dears, > > Is there any way that I can do that in other way? > > I mean if you look at my main problem again you will find out that I have > > two types of fields in my documents. 1) The ones that should be > overwritten > > on duplicates, 2) The ones that should not change during duplicates. So > Is > > it another way to handle this situation from the first place? I mean > using > > cross join for example? > > Assume I have a document with ID 2 which contains all the fields that can > > be overwritten. And another document with ID 2 which contains all fields > > that should not change during duplication detection. For selecting all > > fields it is enough to do join on ID and for Duplication it is enough to > > overwrite just document type 1. > > Regards. > > > > > > On Tue, Jul 1, 2014 at 6:17 PM, Alexandre Rafalovitch < > arafa...@gmail.com> > > wrote: > > > >> Well, it's implemented in SignatureUpdateProcessorFactory. Worst case, > >> you can clone that code and add your preserve-field functionality. > >> Could even be a nice contribution. > >> > >> Regards, > >>Alex. > >> > >> Personal website: http://www.outerthoughts.com/ > >> Current project: http://www.solr-start.com/ - Accelerating your Solr > >> proficiency > >> > >> > >> On Tue, Jul 1, 2014 at 6:50 PM, Ali Nazemian > >> wrote: > >> > Any suggestion would be appreciated. > >> > Regards. > >> > > >> > > >> > On Mon, Jun 30, 2014 at 2:49 PM, Ali Nazemian > >> wrote: > >> > > >> >> Hi, > >> >> I used solr 4.8 for indexing the web pages that come from nutch. I > know > >> >> that solr deduplication operation works on uniquekey field. So I set > >> that > >> >> to URL field. Everything is OK. except that I want after duplication > >> >> detection solr try not to delete all fields of old document. I want > some > >> >> fields remain unchanged. For example assume I have a data field > called > >> >> "read" with Boolean value "true" for specific document. I want all > >> fields > >> >> of new document overwrites except the value of this field. Is that > >> >> possible? How? > >> >> Regards. > >> >> > >> >> -- > >> >> A.Nazemian > >> >> > >> > > >> > > >> > > >> > -- > >> > A.Nazemian > >> > > > > > > > > -- > > A.Nazemian > -- A.Nazemian
Re: solr dedup on specific fields
Dear Alexande, What if I use ExternalFileFiled for the fields that I dont want to be changed? Does that work for me? Regards. On Mon, Jul 7, 2014 at 2:05 PM, Alexandre Rafalovitch wrote: > Well, let us know when you figure out a way to satisfy all your > requirements. > > Solr is designed for a full-document replace to be efficient at it's > primary function (search). Any workaround require some sort of > sacrifice. > > Good luck, >Alex. > Personal website: http://www.outerthoughts.com/ > Current project: http://www.solr-start.com/ - Accelerating your Solr > proficiency > > > On Mon, Jul 7, 2014 at 4:32 PM, Ali Nazemian > wrote: > > Updating documents will add some extra time to indexing process. (I send > > the documents via apache Nutch) I prefer to make indexing as fast as > > possible. > > > > > > On Mon, Jul 7, 2014 at 12:05 PM, Alexandre Rafalovitch < > arafa...@gmail.com> > > wrote: > > > >> Can you use Update operation instead of Create? Then, you can supply > >> only the fields that need to be changed and use atomic update to > >> preserve the others. But then you will have issues when you _are_ > >> creating new documents and you do need to store all fields. > >> > >> Regards, > >>Alex. > >> Personal website: http://www.outerthoughts.com/ > >> Current project: http://www.solr-start.com/ - Accelerating your Solr > >> proficiency > >> > >> > >> On Mon, Jul 7, 2014 at 2:08 PM, Ali Nazemian > >> wrote: > >> > Dears, > >> > Is there any way that I can do that in other way? > >> > I mean if you look at my main problem again you will find out that I > have > >> > two types of fields in my documents. 1) The ones that should be > >> overwritten > >> > on duplicates, 2) The ones that should not change during duplicates. > So > >> Is > >> > it another way to handle this situation from the first place? I mean > >> using > >> > cross join for example? > >> > Assume I have a document with ID 2 which contains all the fields that > can > >> > be overwritten. And another document with ID 2 which contains all > fields > >> > that should not change during duplication detection. For selecting all > >> > fields it is enough to do join on ID and for Duplication it is enough > to > >> > overwrite just document type 1. > >> > Regards. > >> > > >> > > >> > On Tue, Jul 1, 2014 at 6:17 PM, Alexandre Rafalovitch < > >> arafa...@gmail.com> > >> > wrote: > >> > > >> >> Well, it's implemented in SignatureUpdateProcessorFactory. Worst > case, > >> >> you can clone that code and add your preserve-field functionality. > >> >> Could even be a nice contribution. > >> >> > >> >> Regards, > >> >>Alex. > >> >> > >> >> Personal website: http://www.outerthoughts.com/ > >> >> Current project: http://www.solr-start.com/ - Accelerating your Solr > >> >> proficiency > >> >> > >> >> > >> >> On Tue, Jul 1, 2014 at 6:50 PM, Ali Nazemian > >> >> wrote: > >> >> > Any suggestion would be appreciated. > >> >> > Regards. > >> >> > > >> >> > > >> >> > On Mon, Jun 30, 2014 at 2:49 PM, Ali Nazemian < > alinazem...@gmail.com> > >> >> wrote: > >> >> > > >> >> >> Hi, > >> >> >> I used solr 4.8 for indexing the web pages that come from nutch. I > >> know > >> >> >> that solr deduplication operation works on uniquekey field. So I > set > >> >> that > >> >> >> to URL field. Everything is OK. except that I want after > duplication > >> >> >> detection solr try not to delete all fields of old document. I > want > >> some > >> >> >> fields remain unchanged. For example assume I have a data field > >> called > >> >> >> "read" with Boolean value "true" for specific document. I want all > >> >> fields > >> >> >> of new document overwrites except the value of this field. Is that > >> >> >> possible? How? > >> >> >> Regards. > >> >> >> > >> >> >> -- > >> >> >> A.Nazemian > >> >> >> > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > A.Nazemian > >> >> > >> > > >> > > >> > > >> > -- > >> > A.Nazemian > >> > > > > > > > > -- > > A.Nazemian > -- A.Nazemian
Re: solr dedup on specific fields
Yeah, unfortunately I want it to be searchable:( On Mon, Jul 7, 2014 at 2:23 PM, Alexandre Rafalovitch wrote: > It's an interesting thought. I haven't tried those. > > But I don't think the EFFs are searchable. Do you need them to be > searchable? > > Regards, >Alex. > Personal website: http://www.outerthoughts.com/ > Current project: http://www.solr-start.com/ - Accelerating your Solr > proficiency > > > On Mon, Jul 7, 2014 at 4:48 PM, Ali Nazemian > wrote: > > Dear Alexande, > > What if I use ExternalFileFiled for the fields that I dont want to be > > changed? Does that work for me? > > Regards. > > > > > > On Mon, Jul 7, 2014 at 2:05 PM, Alexandre Rafalovitch < > arafa...@gmail.com> > > wrote: > > > >> Well, let us know when you figure out a way to satisfy all your > >> requirements. > >> > >> Solr is designed for a full-document replace to be efficient at it's > >> primary function (search). Any workaround require some sort of > >> sacrifice. > >> > >> Good luck, > >>Alex. > >> Personal website: http://www.outerthoughts.com/ > >> Current project: http://www.solr-start.com/ - Accelerating your Solr > >> proficiency > >> > >> > >> On Mon, Jul 7, 2014 at 4:32 PM, Ali Nazemian > >> wrote: > >> > Updating documents will add some extra time to indexing process. (I > send > >> > the documents via apache Nutch) I prefer to make indexing as fast as > >> > possible. > >> > > >> > > >> > On Mon, Jul 7, 2014 at 12:05 PM, Alexandre Rafalovitch < > >> arafa...@gmail.com> > >> > wrote: > >> > > >> >> Can you use Update operation instead of Create? Then, you can supply > >> >> only the fields that need to be changed and use atomic update to > >> >> preserve the others. But then you will have issues when you _are_ > >> >> creating new documents and you do need to store all fields. > >> >> > >> >> Regards, > >> >>Alex. > >> >> Personal website: http://www.outerthoughts.com/ > >> >> Current project: http://www.solr-start.com/ - Accelerating your Solr > >> >> proficiency > >> >> > >> >> > >> >> On Mon, Jul 7, 2014 at 2:08 PM, Ali Nazemian > >> >> wrote: > >> >> > Dears, > >> >> > Is there any way that I can do that in other way? > >> >> > I mean if you look at my main problem again you will find out that > I > >> have > >> >> > two types of fields in my documents. 1) The ones that should be > >> >> overwritten > >> >> > on duplicates, 2) The ones that should not change during > duplicates. > >> So > >> >> Is > >> >> > it another way to handle this situation from the first place? I > mean > >> >> using > >> >> > cross join for example? > >> >> > Assume I have a document with ID 2 which contains all the fields > that > >> can > >> >> > be overwritten. And another document with ID 2 which contains all > >> fields > >> >> > that should not change during duplication detection. For selecting > all > >> >> > fields it is enough to do join on ID and for Duplication it is > enough > >> to > >> >> > overwrite just document type 1. > >> >> > Regards. > >> >> > > >> >> > > >> >> > On Tue, Jul 1, 2014 at 6:17 PM, Alexandre Rafalovitch < > >> >> arafa...@gmail.com> > >> >> > wrote: > >> >> > > >> >> >> Well, it's implemented in SignatureUpdateProcessorFactory. Worst > >> case, > >> >> >> you can clone that code and add your preserve-field functionality. > >> >> >> Could even be a nice contribution. > >> >> >> > >> >> >> Regards, > >> >> >>Alex. > >> >> >> > >> >> >> Personal website: http://www.outerthoughts.com/ > >> >> >> Current project: http://www.solr-start.com/ - Accelerating your > Solr > >> >> >> proficiency > >> >> >> > >> >> >> > >> >> >> On Tue, Jul 1, 2014 at 6:50 PM, Ali Nazemian < > alinazem...@gmail.com> > >> >> >> wrote: > >> >> >> > Any suggestion would be appreciated. > >> >> >> > Regards. > >> >> >> > > >> >> >> > > >> >> >> > On Mon, Jun 30, 2014 at 2:49 PM, Ali Nazemian < > >> alinazem...@gmail.com> > >> >> >> wrote: > >> >> >> > > >> >> >> >> Hi, > >> >> >> >> I used solr 4.8 for indexing the web pages that come from > nutch. I > >> >> know > >> >> >> >> that solr deduplication operation works on uniquekey field. So > I > >> set > >> >> >> that > >> >> >> >> to URL field. Everything is OK. except that I want after > >> duplication > >> >> >> >> detection solr try not to delete all fields of old document. I > >> want > >> >> some > >> >> >> >> fields remain unchanged. For example assume I have a data field > >> >> called > >> >> >> >> "read" with Boolean value "true" for specific document. I want > all > >> >> >> fields > >> >> >> >> of new document overwrites except the value of this field. Is > that > >> >> >> >> possible? How? > >> >> >> >> Regards. > >> >> >> >> > >> >> >> >> -- > >> >> >> >> A.Nazemian > >> >> >> >> > >> >> >> > > >> >> >> > > >> >> >> > > >> >> >> > -- > >> >> >> > A.Nazemian > >> >> >> > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > A.Nazemian > >> >> > >> > > >> > > >> > > >> > -- > >> > A.Nazemian > >> > > > > > > > > -- > > A.Nazemian > -- A.Nazemian
Re: Need of hadoop
I think this will not improve the performance of indexing but probably it would be a solution for using HDFS HA with replication factor. But I am not sure about that. On Mon, Jul 7, 2014 at 12:53 PM, search engn dev wrote: > Currently i am exploring hadoop with solr, Somewhere it is written as "This > does not use Hadoop Map-Reduce to process Solr data, rather it only uses > the > HDFS filesystem for index and transaction log file storage. " , > > then what is the advantage of using using hadoop over local file system? > will use of hdfs increase overall performance of searching? > > any detailed pointers regarding this will surely help me to understand > this. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Need-of-hadoop-tp4145846.html > Sent from the Solr - User mailing list archive at Nabble.com. > -- A.Nazemian
Changing default behavior of solr for overwrite the whole document on uniquekey duplication
Dears, Hi, According to my requirement I need to change the default behavior of Solr for overwriting the whole document on unique-key duplication. I am going to change that the overwrite just part of document (some fields) and other parts of document (other fields) remain unchanged. First of all I need to know such changing in Solr behavior is possible? Second, I really appreciate if you can guide me through what class/classes should I consider for changing that? Best regards. -- A.Nazemian
Re: Changing default behavior of solr for overwrite the whole document on uniquekey duplication
Dear Himanshu, Hi, You misunderstood what I meant. I am not going to update some field. I am going to change what Solr do on duplication of uniquekey field. I dont want to solr overwrite Whole document I just want to overwrite some parts of document. This situation does not come from user side this is what solr do to documents with duplicated uniquekey. Regards. On Tue, Jul 8, 2014 at 12:29 PM, Himanshu Mehrotra < himanshu.mehro...@snapdeal.com> wrote: > Please look at https://wiki.apache.org/solr/Atomic_Updates > > This does what you want just update relevant fields. > > Thanks, > Himanshu > > > On Tue, Jul 8, 2014 at 1:09 PM, Ali Nazemian > wrote: > > > Dears, > > Hi, > > According to my requirement I need to change the default behavior of Solr > > for overwriting the whole document on unique-key duplication. I am going > to > > change that the overwrite just part of document (some fields) and other > > parts of document (other fields) remain unchanged. First of all I need to > > know such changing in Solr behavior is possible? Second, I really > > appreciate if you can guide me through what class/classes should I > consider > > for changing that? > > Best regards. > > > > -- > > A.Nazemian > > > -- A.Nazemian
Re: Changing default behavior of solr for overwrite the whole document on uniquekey duplication
Thank you very much. Now I understand what was the idea. It is better than changing Solr. But does performance remain same in this situation? On Tue, Jul 8, 2014 at 10:43 PM, Chris Hostetter wrote: > > I think you are missunderstanding what Himanshu is suggesting to you. > > You don't need to make lots of big changes ot the internals of solr's code > to get what you want -- instead you can leverage the Atomic Updates & > Optimistic Concurrency features of Solr to get the existing internal Solr > to reject any attempts to add a duplicate documentunless the client code > sending the document specifies it should be an "update". > > This means your client code needs to be a bit more sophisticated, but the > benefit is that you don't have to try to make complex changes to the > internals of Solr that may be impossible and/or difficult to > support/upgrade later. > > More details... > > > https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents#UpdatingPartsofDocuments-OptimisticConcurrency > > Simplest possible idea based on the basic info you have given so far... > > 1) send every doc using _version_=-1 > 2a) if doc update fails with error 409, that means a version of this doc > already exists > 2b) resend just the field changes (using "set" atomic > operation) and specify _version_=1 > > > > : Dear Himanshu, > : Hi, > : You misunderstood what I meant. I am not going to update some field. I am > : going to change what Solr do on duplication of uniquekey field. I dont > want > : to solr overwrite Whole document I just want to overwrite some parts of > : document. This situation does not come from user side this is what solr > do > : to documents with duplicated uniquekey. > : Regards. > : > : > : On Tue, Jul 8, 2014 at 12:29 PM, Himanshu Mehrotra < > : himanshu.mehro...@snapdeal.com> wrote: > : > : > Please look at https://wiki.apache.org/solr/Atomic_Updates > : > > : > This does what you want just update relevant fields. > : > > : > Thanks, > : > Himanshu > : > > : > > : > On Tue, Jul 8, 2014 at 1:09 PM, Ali Nazemian > : > wrote: > : > > : > > Dears, > : > > Hi, > : > > According to my requirement I need to change the default behavior of > Solr > : > > for overwriting the whole document on unique-key duplication. I am > going > : > to > : > > change that the overwrite just part of document (some fields) and > other > : > > parts of document (other fields) remain unchanged. First of all I > need to > : > > know such changing in Solr behavior is possible? Second, I really > : > > appreciate if you can guide me through what class/classes should I > : > consider > : > > for changing that? > : > > Best regards. > : > > > : > > -- > : > > A.Nazemian > : > > > : > > : > : > : > : -- > : A.Nazemian > : > > -Hoss > http://www.lucidworks.com/ > -- A.Nazemian
integrating Accumulo with solr
Dear All, Hi, I was wondering is there anybody out there that tried to integrate Solr with Accumulo? I was thinking about using Accumulo on top of HDFS and using Solr to index data inside Accumulo? Do you have any idea how can I do such integration? Best regards. -- A.Nazemian
Re: integrating Accumulo with solr
Dear Joe, Hi, I am going to store the crawl web pages in accumulo as the main storage part of my project and I need to give these data to solr for indexing and user searches. I need to do some social and web analysis on my data as well as having some security features. Therefore accumulo is my choice for the database part and for index and search I am going to use Solr. Would you please guide me through that? On Thu, Jul 24, 2014 at 1:28 AM, Joe Gresock wrote: > We store data in both Solr and Accumulo -- do you have more details about > what kind of data and indexing you want? Is there a reason you're thinking > of using both databases in particular? > > > On Wed, Jul 23, 2014 at 5:17 AM, Ali Nazemian > wrote: > > > Dear All, > > Hi, > > I was wondering is there anybody out there that tried to integrate Solr > > with Accumulo? I was thinking about using Accumulo on top of HDFS and > using > > Solr to index data inside Accumulo? Do you have any idea how can I do > such > > integration? > > > > Best regards. > > > > -- > > A.Nazemian > > > > > > -- > I know what it is to be in need, and I know what it is to have plenty. I > have learned the secret of being content in any and every situation, > whether well fed or hungry, whether living in plenty or in want. I can do > all this through him who gives me strength.*-Philippians 4:12-13* > -- A.Nazemian
Re: integrating Accumulo with solr
Thank you very much. Nice Idea but how can Solr and Accumulo can be synchronized in this way? I know that Solr can be integrated with HDFS and also Accumulo works on the top of HDFS. So can I use HDFS as integration point? I mean set Solr to use HDFS as a source of documents as well as the destination of documents. Regards. On Thu, Jul 24, 2014 at 4:33 PM, Joe Gresock wrote: > Ali, > > Sounds like a good choice. It's pretty standard to store the primary > storage id as a field in Solr so that you can search the full text in Solr > and then retrieve the full document elsewhere. > > I would recommend creating a document structure in Solr with whatever > fields you want indexed (most likely as text_en, etc.), and then store a > "string" field named "content_id", which would be the Accumulo row id that > you look up with a scan. > > One caveat -- Accumulo will be protected at the cell level, but if you need > your Solr search results to be protected by complex authorization strings > similar to Accumulo, you will need to write your own QParserPlugin and use > post filtering: > http://java.dzone.com/articles/custom-security-filtering-solr > > The code you see in that article is written for an earlier version of Solr, > but it's not too difficult to adjust it for the latest (we've done so in > our project). Once you've implemented this, you would store an > "authorizations" string field in each Solr document, and pass in the > authorizations that the user has access to in the fq parameter of every > query. It's also not too bad to write something that parses the Accumulo > authorizations string (like A&B&(C|D|E|F)) and interpret it accordingly in > the QParserPlugin. > > This will give you true row level security in Solr and Accumulo, and it > performs quite well in Solr. > > Let me know if you have any other questions. > > Joe > > > On Thu, Jul 24, 2014 at 4:07 AM, Ali Nazemian > wrote: > > > Dear Joe, > > Hi, > > I am going to store the crawl web pages in accumulo as the main storage > > part of my project and I need to give these data to solr for indexing and > > user searches. I need to do some social and web analysis on my data as > well > > as having some security features. Therefore accumulo is my choice for the > > database part and for index and search I am going to use Solr. Would you > > please guide me through that? > > > > > > > > On Thu, Jul 24, 2014 at 1:28 AM, Joe Gresock wrote: > > > > > We store data in both Solr and Accumulo -- do you have more details > about > > > what kind of data and indexing you want? Is there a reason you're > > thinking > > > of using both databases in particular? > > > > > > > > > On Wed, Jul 23, 2014 at 5:17 AM, Ali Nazemian > > > wrote: > > > > > > > Dear All, > > > > Hi, > > > > I was wondering is there anybody out there that tried to integrate > Solr > > > > with Accumulo? I was thinking about using Accumulo on top of HDFS and > > > using > > > > Solr to index data inside Accumulo? Do you have any idea how can I do > > > such > > > > integration? > > > > > > > > Best regards. > > > > > > > > -- > > > > A.Nazemian > > > > > > > > > > > > > > > > -- > > > I know what it is to be in need, and I know what it is to have plenty. > I > > > have learned the secret of being content in any and every situation, > > > whether well fed or hungry, whether living in plenty or in want. I can > > do > > > all this through him who gives me strength.*-Philippians 4:12-13* > > > > > > > > > > > -- > > A.Nazemian > > > > > > -- > I know what it is to be in need, and I know what it is to have plenty. I > have learned the secret of being content in any and every situation, > whether well fed or hungry, whether living in plenty or in want. I can do > all this through him who gives me strength.*-Philippians 4:12-13* > -- A.Nazemian
Re: integrating Accumulo with solr
Dear Jack, Thank you. I am aware of datastax but I am looking for integrating accumulo with solr. This is something like what sqrrl guys offer. Regards. On Thu, Jul 24, 2014 at 7:27 PM, Jack Krupansky wrote: > If you are not a "true hard-core gunslinger" who is willing to dive in and > integrate the code yourself, instead you should give serious consideration > to a product such as DataStax Enterprise that fully integrates and packages > a NoSQL database (Cassandra) and Solr for search. The security aspects are > still a work in progress, but certainly headed in the right direction. And > it has Hadoop and Spark integration as well. > > See: > http://www.datastax.com/what-we-offer/products-services/ > datastax-enterprise > > -- Jack Krupansky > > -Original Message- From: Ali Nazemian > Sent: Thursday, July 24, 2014 10:30 AM > To: solr-user@lucene.apache.org > Subject: Re: integrating Accumulo with solr > > > Thank you very much. Nice Idea but how can Solr and Accumulo can be > synchronized in this way? > I know that Solr can be integrated with HDFS and also Accumulo works on the > top of HDFS. So can I use HDFS as integration point? I mean set Solr to use > HDFS as a source of documents as well as the destination of documents. > Regards. > > > On Thu, Jul 24, 2014 at 4:33 PM, Joe Gresock wrote: > > Ali, >> >> Sounds like a good choice. It's pretty standard to store the primary >> storage id as a field in Solr so that you can search the full text in Solr >> and then retrieve the full document elsewhere. >> >> I would recommend creating a document structure in Solr with whatever >> fields you want indexed (most likely as text_en, etc.), and then store a >> "string" field named "content_id", which would be the Accumulo row id that >> you look up with a scan. >> >> One caveat -- Accumulo will be protected at the cell level, but if you >> need >> your Solr search results to be protected by complex authorization strings >> similar to Accumulo, you will need to write your own QParserPlugin and use >> post filtering: >> http://java.dzone.com/articles/custom-security-filtering-solr >> >> The code you see in that article is written for an earlier version of >> Solr, >> but it's not too difficult to adjust it for the latest (we've done so in >> our project). Once you've implemented this, you would store an >> "authorizations" string field in each Solr document, and pass in the >> authorizations that the user has access to in the fq parameter of every >> query. It's also not too bad to write something that parses the Accumulo >> authorizations string (like A&B&(C|D|E|F)) and interpret it accordingly in >> the QParserPlugin. >> >> This will give you true row level security in Solr and Accumulo, and it >> performs quite well in Solr. >> >> Let me know if you have any other questions. >> >> Joe >> >> >> On Thu, Jul 24, 2014 at 4:07 AM, Ali Nazemian >> wrote: >> >> > Dear Joe, >> > Hi, >> > I am going to store the crawl web pages in accumulo as the main storage >> > part of my project and I need to give these data to solr for indexing > >> and >> > user searches. I need to do some social and web analysis on my data as >> well >> > as having some security features. Therefore accumulo is my choice for > >> the >> > database part and for index and search I am going to use Solr. Would you >> > please guide me through that? >> > >> > >> > >> > On Thu, Jul 24, 2014 at 1:28 AM, Joe Gresock >> wrote: >> > >> > > We store data in both Solr and Accumulo -- do you have more details >> about >> > > what kind of data and indexing you want? Is there a reason you're >> > thinking >> > > of using both databases in particular? >> > > >> > > >> > > On Wed, Jul 23, 2014 at 5:17 AM, Ali Nazemian >> > > wrote: >> > > >> > > > Dear All, >> > > > Hi, >> > > > I was wondering is there anybody out there that tried to integrate >> Solr >> > > > with Accumulo? I was thinking about using Accumulo on top of HDFS > >> > > and >> > > using >> > > > Solr to index data inside Accumulo? Do you have any idea how can I >> > > > do >> > > such >> > > > integration? >> > > > >> > > > Best regards. >> > > > >> > > > -- >> > > > A.Nazemian >> > > > >> > > >> > > >> > > >> > > -- >> > > I know what it is to be in need, and I know what it is to have plenty. >> I >> > > have learned the secret of being content in any and every situation, >> > > whether well fed or hungry, whether living in plenty or in want. I > >> > can >> > do >> > > all this through him who gives me strength.*-Philippians 4:12-13* >> > > >> > >> > >> > >> > -- >> > A.Nazemian >> > >> >> >> >> -- >> I know what it is to be in need, and I know what it is to have plenty. I >> have learned the secret of being content in any and every situation, >> whether well fed or hungry, whether living in plenty or in want. I can do >> all this through him who gives me strength.*-Philippians 4:12-13* >> >> > > > -- > A.Nazemian > -- A.Nazemian
Re: integrating Accumulo with solr
Dear Jack, Actually I am going to do benefit-cost analysis for in-house developement or going for sqrrl support. Best regards. On Thu, Jul 24, 2014 at 11:48 PM, Jack Krupansky wrote: > Like I said, you're going to have to be a real, hard-core gunslinger to do > that well. Sqrrl uses Lucene directly, BTW: > > "Full-Text Search: Utilizing open-source Lucene and custom indexing > methods, Sqrrl Enterprise users can conduct real-time, full-text search > across data in Sqrrl Enterprise." > > See: > http://sqrrl.com/product/search/ > > Out of curiosity, why are you not using that integrated Lucene support of > Sqrrl Enterprise? > > > -- Jack Krupansky > > -Original Message- From: Ali Nazemian > Sent: Thursday, July 24, 2014 3:07 PM > > To: solr-user@lucene.apache.org > Subject: Re: integrating Accumulo with solr > > Dear Jack, > Thank you. I am aware of datastax but I am looking for integrating accumulo > with solr. This is something like what sqrrl guys offer. > Regards. > > > On Thu, Jul 24, 2014 at 7:27 PM, Jack Krupansky > wrote: > > If you are not a "true hard-core gunslinger" who is willing to dive in and >> integrate the code yourself, instead you should give serious consideration >> to a product such as DataStax Enterprise that fully integrates and >> packages >> a NoSQL database (Cassandra) and Solr for search. The security aspects are >> still a work in progress, but certainly headed in the right direction. And >> it has Hadoop and Spark integration as well. >> >> See: >> http://www.datastax.com/what-we-offer/products-services/ >> datastax-enterprise >> >> -- Jack Krupansky >> >> -Original Message- From: Ali Nazemian >> Sent: Thursday, July 24, 2014 10:30 AM >> To: solr-user@lucene.apache.org >> Subject: Re: integrating Accumulo with solr >> >> >> Thank you very much. Nice Idea but how can Solr and Accumulo can be >> synchronized in this way? >> I know that Solr can be integrated with HDFS and also Accumulo works on >> the >> top of HDFS. So can I use HDFS as integration point? I mean set Solr to >> use >> HDFS as a source of documents as well as the destination of documents. >> Regards. >> >> >> On Thu, Jul 24, 2014 at 4:33 PM, Joe Gresock wrote: >> >> Ali, >> >>> >>> Sounds like a good choice. It's pretty standard to store the primary >>> storage id as a field in Solr so that you can search the full text in >>> Solr >>> and then retrieve the full document elsewhere. >>> >>> I would recommend creating a document structure in Solr with whatever >>> fields you want indexed (most likely as text_en, etc.), and then store a >>> "string" field named "content_id", which would be the Accumulo row id >>> that >>> you look up with a scan. >>> >>> One caveat -- Accumulo will be protected at the cell level, but if you >>> need >>> your Solr search results to be protected by complex authorization strings >>> similar to Accumulo, you will need to write your own QParserPlugin and >>> use >>> post filtering: >>> http://java.dzone.com/articles/custom-security-filtering-solr >>> >>> The code you see in that article is written for an earlier version of >>> Solr, >>> but it's not too difficult to adjust it for the latest (we've done so in >>> our project). Once you've implemented this, you would store an >>> "authorizations" string field in each Solr document, and pass in the >>> authorizations that the user has access to in the fq parameter of every >>> query. It's also not too bad to write something that parses the Accumulo >>> authorizations string (like A&B&(C|D|E|F)) and interpret it accordingly >>> in >>> the QParserPlugin. >>> >>> This will give you true row level security in Solr and Accumulo, and it >>> performs quite well in Solr. >>> >>> Let me know if you have any other questions. >>> >>> Joe >>> >>> >>> On Thu, Jul 24, 2014 at 4:07 AM, Ali Nazemian >>> wrote: >>> >>> > Dear Joe, >>> > Hi, >>> > I am going to store the crawl web pages in accumulo as the main storage >>> > part of my project and I need to give these data to solr for indexing > >>> and >>> > user searches. I need to do some social and web analysis on my data as >>> well >>> > as
Re: integrating Accumulo with solr
Dear Jack, Hi, One more thing to mention: I dont want to use solr or lucence for indexing accumulo or full text search inside that. I am looking for have both in a sync mode. I mean import some parts of data to solr for indexing. For this purpose probably I need something like trigger in RDBMS, I have to define something (probably with accumulo iterator) to import to solr on inserting new data. Regards. On Fri, Jul 25, 2014 at 12:59 PM, Ali Nazemian wrote: > Dear Jack, > Actually I am going to do benefit-cost analysis for in-house developement > or going for sqrrl support. > Best regards. > > > On Thu, Jul 24, 2014 at 11:48 PM, Jack Krupansky > wrote: > >> Like I said, you're going to have to be a real, hard-core gunslinger to >> do that well. Sqrrl uses Lucene directly, BTW: >> >> "Full-Text Search: Utilizing open-source Lucene and custom indexing >> methods, Sqrrl Enterprise users can conduct real-time, full-text search >> across data in Sqrrl Enterprise." >> >> See: >> http://sqrrl.com/product/search/ >> >> Out of curiosity, why are you not using that integrated Lucene support of >> Sqrrl Enterprise? >> >> >> -- Jack Krupansky >> >> -Original Message- From: Ali Nazemian >> Sent: Thursday, July 24, 2014 3:07 PM >> >> To: solr-user@lucene.apache.org >> Subject: Re: integrating Accumulo with solr >> >> Dear Jack, >> Thank you. I am aware of datastax but I am looking for integrating >> accumulo >> with solr. This is something like what sqrrl guys offer. >> Regards. >> >> >> On Thu, Jul 24, 2014 at 7:27 PM, Jack Krupansky >> wrote: >> >> If you are not a "true hard-core gunslinger" who is willing to dive in >>> and >>> integrate the code yourself, instead you should give serious >>> consideration >>> to a product such as DataStax Enterprise that fully integrates and >>> packages >>> a NoSQL database (Cassandra) and Solr for search. The security aspects >>> are >>> still a work in progress, but certainly headed in the right direction. >>> And >>> it has Hadoop and Spark integration as well. >>> >>> See: >>> http://www.datastax.com/what-we-offer/products-services/ >>> datastax-enterprise >>> >>> -- Jack Krupansky >>> >>> -Original Message- From: Ali Nazemian >>> Sent: Thursday, July 24, 2014 10:30 AM >>> To: solr-user@lucene.apache.org >>> Subject: Re: integrating Accumulo with solr >>> >>> >>> Thank you very much. Nice Idea but how can Solr and Accumulo can be >>> synchronized in this way? >>> I know that Solr can be integrated with HDFS and also Accumulo works on >>> the >>> top of HDFS. So can I use HDFS as integration point? I mean set Solr to >>> use >>> HDFS as a source of documents as well as the destination of documents. >>> Regards. >>> >>> >>> On Thu, Jul 24, 2014 at 4:33 PM, Joe Gresock wrote: >>> >>> Ali, >>> >>>> >>>> Sounds like a good choice. It's pretty standard to store the primary >>>> storage id as a field in Solr so that you can search the full text in >>>> Solr >>>> and then retrieve the full document elsewhere. >>>> >>>> I would recommend creating a document structure in Solr with whatever >>>> fields you want indexed (most likely as text_en, etc.), and then store a >>>> "string" field named "content_id", which would be the Accumulo row id >>>> that >>>> you look up with a scan. >>>> >>>> One caveat -- Accumulo will be protected at the cell level, but if you >>>> need >>>> your Solr search results to be protected by complex authorization >>>> strings >>>> similar to Accumulo, you will need to write your own QParserPlugin and >>>> use >>>> post filtering: >>>> http://java.dzone.com/articles/custom-security-filtering-solr >>>> >>>> The code you see in that article is written for an earlier version of >>>> Solr, >>>> but it's not too difficult to adjust it for the latest (we've done so in >>>> our project). Once you've implemented this, you would store an >>>> "authorizations" string field in each Solr document, and pass in the >>>> authorizations that the user has access to in the fq parameter of every >>
Re: integrating Accumulo with solr
Sure, Thank you very much for your guide. I think I am not that kind of gunslinger and probably I will go for another NoSQL that can be integrated with solr/elastic search much easier:) Best regards. On Sun, Jul 27, 2014 at 5:02 PM, Jack Krupansky wrote: > Right, and that's exactly what DataStax Enterprise provides (at great > engineering effort!) - synchronization of database updates and search > indexing. Sure, you can do it as well, but that's a significant engineering > challenge with both sides of the equation, and not a simple "plug and play" > configuration setting by writing a simple "connector." > > But, hey, if you consider yourself one of those "true hard-core > gunslingers" then you'll be able to code that up in a weekend without any > of our assistance, right? > > In short, synchronizing two data stores is a real challenge. Yes, it is > doable, but... it is non-trivial. Especially if both stores are distributed > clusters. Maybe now you can guess why the Sqrrl guys went the Lucene route > instead of Solr. > > I'm certainly not suggesting that it can't be done. Just highlighting the > challenge of such a task. > > Just to be clear, you are referring to "sync mode" and not mere "ETL", > which people do all the time with batch scripts, Java extraction and > ingestion connectors, and cron jobs. > > Give it a shot and let us know how it works out. > > > -- Jack Krupansky > > -Original Message- From: Ali Nazemian > Sent: Sunday, July 27, 2014 1:20 AM > > To: solr-user@lucene.apache.org > Subject: Re: integrating Accumulo with solr > > Dear Jack, > Hi, > One more thing to mention: I dont want to use solr or lucence for indexing > accumulo or full text search inside that. I am looking for have both in a > sync mode. I mean import some parts of data to solr for indexing. For this > purpose probably I need something like trigger in RDBMS, I have to define > something (probably with accumulo iterator) to import to solr on inserting > new data. > Regards. > > On Fri, Jul 25, 2014 at 12:59 PM, Ali Nazemian > wrote: > > Dear Jack, >> Actually I am going to do benefit-cost analysis for in-house developement >> or going for sqrrl support. >> Best regards. >> >> >> On Thu, Jul 24, 2014 at 11:48 PM, Jack Krupansky > > >> wrote: >> >> Like I said, you're going to have to be a real, hard-core gunslinger to >>> do that well. Sqrrl uses Lucene directly, BTW: >>> >>> "Full-Text Search: Utilizing open-source Lucene and custom indexing >>> methods, Sqrrl Enterprise users can conduct real-time, full-text search >>> across data in Sqrrl Enterprise." >>> >>> See: >>> http://sqrrl.com/product/search/ >>> >>> Out of curiosity, why are you not using that integrated Lucene support of >>> Sqrrl Enterprise? >>> >>> >>> -- Jack Krupansky >>> >>> -Original Message- From: Ali Nazemian >>> Sent: Thursday, July 24, 2014 3:07 PM >>> >>> To: solr-user@lucene.apache.org >>> Subject: Re: integrating Accumulo with solr >>> >>> Dear Jack, >>> Thank you. I am aware of datastax but I am looking for integrating >>> accumulo >>> with solr. This is something like what sqrrl guys offer. >>> Regards. >>> >>> >>> On Thu, Jul 24, 2014 at 7:27 PM, Jack Krupansky >> > >>> wrote: >>> >>> If you are not a "true hard-core gunslinger" who is willing to dive in >>> >>>> and >>>> integrate the code yourself, instead you should give serious >>>> consideration >>>> to a product such as DataStax Enterprise that fully integrates and >>>> packages >>>> a NoSQL database (Cassandra) and Solr for search. The security aspects >>>> are >>>> still a work in progress, but certainly headed in the right direction. >>>> And >>>> it has Hadoop and Spark integration as well. >>>> >>>> See: >>>> http://www.datastax.com/what-we-offer/products-services/ >>>> datastax-enterprise >>>> >>>> -- Jack Krupansky >>>> >>>> -Original Message- From: Ali Nazemian >>>> Sent: Thursday, July 24, 2014 10:30 AM >>>> To: solr-user@lucene.apache.org >>>> Subject: Re: integrating Accumulo with solr >>>> >>>> >>>> Thank you very much. Nice Idea but how can Solr and Accumulo can
solr over hdfs for accessing/ changing indexes outside solr
Dear all, Hi, I changed solr 4.9 to write index and data on hdfs. Now I am going to connect to those data from the outside of solr for changing some of the values. Could somebody please tell me how that is possible? Suppose I am using Hbase over hdfs for do these changes. Best regards. -- A.Nazemian
Re: solr over hdfs for accessing/ changing indexes outside solr
Actually I am going to do some analysis on the solr data using map reduce. For this purpose it might be needed to change some part of data or add new fields from outside solr. On Tue, Aug 5, 2014 at 5:51 PM, Shawn Heisey wrote: > On 8/5/2014 7:04 AM, Ali Nazemian wrote: > > I changed solr 4.9 to write index and data on hdfs. Now I am going to > > connect to those data from the outside of solr for changing some of the > > values. Could somebody please tell me how that is possible? Suppose I am > > using Hbase over hdfs for do these changes. > > I don't know how you could safely modify the index without a Lucene > application or another instance of Solr, but if you do manage to modify > the index, simply reloading the core or restarting Solr should cause it > to pick up the changes. Either you would need to make sure that Solr > never modifies the index, or you would need some way of coordinating > updates so that Solr and the other application would never try to modify > the index at the same time. > > Thanks, > Shawn > > -- A.Nazemian
Re: solr over hdfs for accessing/ changing indexes outside solr
Dear Erick, Hi, Thank you for you reply. Yeah I am aware that SolrJ is my last option. I was thinking about raw I/O operation. So according to your reply probably it is not applicable somehow. What about the Lily project that Michael mentioned? Is that consider SolrJ too? Are you aware of Cloudera search? I know they provide an integrated Hadoop ecosystem. Do you know what is their suggestion? Best regards. On Wed, Aug 6, 2014 at 12:28 AM, Erick Erickson wrote: > What you haven't told us is what you mean by "modify the > index outside Solr". SolrJ? Using raw Lucene? Trying to modify > things by writing your own codec? Standard Java I/O operations? > Other? > > You could use SolrJ to connect to an existing Solr server and > both read and modify at will form your M/R jobs. But if you're > thinking of trying to write/modify the segment files by raw I/O > operations, good luck! I'm 99.99% certain that's going to cause > you endless grief. > > Best, > Erick > > > On Tue, Aug 5, 2014 at 9:55 AM, Ali Nazemian > wrote: > > > Actually I am going to do some analysis on the solr data using map > reduce. > > For this purpose it might be needed to change some part of data or add > new > > fields from outside solr. > > > > > > On Tue, Aug 5, 2014 at 5:51 PM, Shawn Heisey wrote: > > > > > On 8/5/2014 7:04 AM, Ali Nazemian wrote: > > > > I changed solr 4.9 to write index and data on hdfs. Now I am going to > > > > connect to those data from the outside of solr for changing some of > the > > > > values. Could somebody please tell me how that is possible? Suppose I > > am > > > > using Hbase over hdfs for do these changes. > > > > > > I don't know how you could safely modify the index without a Lucene > > > application or another instance of Solr, but if you do manage to modify > > > the index, simply reloading the core or restarting Solr should cause it > > > to pick up the changes. Either you would need to make sure that Solr > > > never modifies the index, or you would need some way of coordinating > > > updates so that Solr and the other application would never try to > modify > > > the index at the same time. > > > > > > Thanks, > > > Shawn > > > > > > > > > > > > -- > > A.Nazemian > > > -- A.Nazemian
Re: solr over hdfs for accessing/ changing indexes outside solr
Dear Erick, I remembered some times ago, somebody asked about what is the point of modify Solr to use HDFS for storing indexes. As far as I remember somebody told him integrating Solr with HDFS has two advantages. 1) having hadoop replication and HA. 2) using indexes and Solr documents for other purposes such as Analysis. So why we go for HDFS in the case of analysis if we want to use SolrJ for this purpose? What is the point? Regards. On Wed, Aug 6, 2014 at 8:59 AM, Ali Nazemian wrote: > Dear Erick, > Hi, > Thank you for you reply. Yeah I am aware that SolrJ is my last option. I > was thinking about raw I/O operation. So according to your reply probably > it is not applicable somehow. What about the Lily project that Michael > mentioned? Is that consider SolrJ too? Are you aware of Cloudera search? I > know they provide an integrated Hadoop ecosystem. Do you know what is their > suggestion? > Best regards. > > > > On Wed, Aug 6, 2014 at 12:28 AM, Erick Erickson > wrote: > >> What you haven't told us is what you mean by "modify the >> index outside Solr". SolrJ? Using raw Lucene? Trying to modify >> things by writing your own codec? Standard Java I/O operations? >> Other? >> >> You could use SolrJ to connect to an existing Solr server and >> both read and modify at will form your M/R jobs. But if you're >> thinking of trying to write/modify the segment files by raw I/O >> operations, good luck! I'm 99.99% certain that's going to cause >> you endless grief. >> >> Best, >> Erick >> >> >> On Tue, Aug 5, 2014 at 9:55 AM, Ali Nazemian >> wrote: >> >> > Actually I am going to do some analysis on the solr data using map >> reduce. >> > For this purpose it might be needed to change some part of data or add >> new >> > fields from outside solr. >> > >> > >> > On Tue, Aug 5, 2014 at 5:51 PM, Shawn Heisey wrote: >> > >> > > On 8/5/2014 7:04 AM, Ali Nazemian wrote: >> > > > I changed solr 4.9 to write index and data on hdfs. Now I am going >> to >> > > > connect to those data from the outside of solr for changing some of >> the >> > > > values. Could somebody please tell me how that is possible? Suppose >> I >> > am >> > > > using Hbase over hdfs for do these changes. >> > > >> > > I don't know how you could safely modify the index without a Lucene >> > > application or another instance of Solr, but if you do manage to >> modify >> > > the index, simply reloading the core or restarting Solr should cause >> it >> > > to pick up the changes. Either you would need to make sure that Solr >> > > never modifies the index, or you would need some way of coordinating >> > > updates so that Solr and the other application would never try to >> modify >> > > the index at the same time. >> > > >> > > Thanks, >> > > Shawn >> > > >> > > >> > >> > >> > -- >> > A.Nazemian >> > >> > > > > -- > A.Nazemian > -- A.Nazemian
indexing comments with Apache Solr
Dear all, Hi, I was wondering how can I mange to index comments in solr? suppose I am going to index a web page that has a content of news and some comments that are presented by people at the end of this page. How can I index these comments in solr? consider the fact that I am going to do some analysis on these comments. For example I want to have such query flexibility for retrieving all comments that are presented between 24 June 2014 to 24 July 2014! or all the comments that are presented by specific person. Therefore defining these comment as multi-value field would not be the solution since in this case such query flexibility is not feasible. So what is you suggestion about document granularity in this case? Can I consider all of these comments as a new document inside main document (tree based structure). What is your suggestion for this case? I think it is a common case of indexing webpages these days so probably I am not the only one thinking about this situation. Please share you though and perhaps your experiences in this condition with me. Thank you very much. Best regards. -- A.Nazemian
Re: indexing comments with Apache Solr
Dear Gora, I think you misunderstood my problem. Actually I used nutch for crawling websites and my problem is in index side and not crawl side. Suppose page is fetch and parsed by Nutch and all comments and the date and source of comments are identified by parsing. Now what can I do for indexing these comments? What is the document granularity? Best regards. On Wed, Aug 6, 2014 at 1:29 PM, Gora Mohanty wrote: > On 6 August 2014 14:13, Ali Nazemian wrote: > > > > Dear all, > > Hi, > > I was wondering how can I mange to index comments in solr? suppose I am > > going to index a web page that has a content of news and some comments > that > > are presented by people at the end of this page. How can I index these > > comments in solr? consider the fact that I am going to do some analysis > on > > these comments. For example I want to have such query flexibility for > > retrieving all comments that are presented between 24 June 2014 to 24 > July > > 2014! or all the comments that are presented by specific person. > Therefore > > defining these comment as multi-value field would not be the solution > since > > in this case such query flexibility is not feasible. So what is you > > suggestion about document granularity in this case? Can I consider all of > > these comments as a new document inside main document (tree based > > structure). What is your suggestion for this case? I think it is a common > > case of indexing webpages these days so probably I am not the only one > > thinking about this situation. Please share you though and perhaps your > > experiences in this condition with me. Thank you very much. > > Parsing a web page, and breaking up parts up for indexing into different > fields > is out of the scope of Solr. You might want to look at Apache Nutch which > can index into Solr, and/or other web crawlers/scrapers. > > Regards, > Gora > -- A.Nazemian
Re: indexing comments with Apache Solr
Dear Alexandre, Hi, Thank you very much. I think nested document is what I need. Do you have more information about how can I define such thing in solr schema? Your mentioned blog post was all about retrieving nested docs. Best regards. On Wed, Aug 6, 2014 at 5:16 PM, Alexandre Rafalovitch wrote: > You can index comments as child records. The structure of the Solr > document should be able to incorporate both parents and children > fields and you need to index them all together. Then, just search for > JOIN syntax for nested documents. Also, latest Solr (4.9) has some > extra functionality that allows you to find all parent pages and then > expand children pages to match. > > E.g.: http://heliosearch.org/expand-block-join/ seems relevant > > Regards, >Alex. > Personal: http://www.outerthoughts.com/ and @arafalov > Solr resources and newsletter: http://www.solr-start.com/ and @solrstart > Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 > > > On Wed, Aug 6, 2014 at 11:18 AM, Ali Nazemian > wrote: > > Dear Gora, > > I think you misunderstood my problem. Actually I used nutch for crawling > > websites and my problem is in index side and not crawl side. Suppose page > > is fetch and parsed by Nutch and all comments and the date and source of > > comments are identified by parsing. Now what can I do for indexing these > > comments? What is the document granularity? > > Best regards. > > > > > > On Wed, Aug 6, 2014 at 1:29 PM, Gora Mohanty wrote: > > > >> On 6 August 2014 14:13, Ali Nazemian wrote: > >> > > >> > Dear all, > >> > Hi, > >> > I was wondering how can I mange to index comments in solr? suppose I > am > >> > going to index a web page that has a content of news and some comments > >> that > >> > are presented by people at the end of this page. How can I index these > >> > comments in solr? consider the fact that I am going to do some > analysis > >> on > >> > these comments. For example I want to have such query flexibility for > >> > retrieving all comments that are presented between 24 June 2014 to 24 > >> July > >> > 2014! or all the comments that are presented by specific person. > >> Therefore > >> > defining these comment as multi-value field would not be the solution > >> since > >> > in this case such query flexibility is not feasible. So what is you > >> > suggestion about document granularity in this case? Can I consider > all of > >> > these comments as a new document inside main document (tree based > >> > structure). What is your suggestion for this case? I think it is a > common > >> > case of indexing webpages these days so probably I am not the only one > >> > thinking about this situation. Please share you though and perhaps > your > >> > experiences in this condition with me. Thank you very much. > >> > >> Parsing a web page, and breaking up parts up for indexing into different > >> fields > >> is out of the scope of Solr. You might want to look at Apache Nutch > which > >> can index into Solr, and/or other web crawlers/scrapers. > >> > >> Regards, > >> Gora > >> > > > > > > > > -- > > A.Nazemian > -- A.Nazemian
Re: solr over hdfs for accessing/ changing indexes outside solr
Thank you very much. But why we should go for solr distributed with hadoop? There is already solrCloud which is pretty applicable in the case of big index. Is there any advantage for sending indexes over map reduce that solrCloud can not provide? Regards. On Wed, Aug 6, 2014 at 9:09 PM, Erick Erickson wrote: > bq: Are you aware of Cloudera search? I know they provide an integrated > Hadoop ecosystem. > > What Cloudera Search does via the MapReduceIndexerTool (MRIT) is create N > sub-indexes for > each shard in the M/R paradigm via EmbeddedSolrServer. Eventually, these > sub-indexes for > each shard are merged (perhaps through some number of levels) in the reduce > phase and > maybe merged into a live Solr instance (--go-live). You'll note that this > tool requires the > address of the ZK ensemble from which it can get the network topology, > configuration files, > all that rot. If you don't use the --go-live option, the output is still a > Solr index, it's just that > the index for each shard is left in a specific directory on HDFS. Being on > HDFS allows > this kind of M/R paradigm for massively parallel indexing operations, and > perhaps massively > complex analysis. > > Nowhere is there any low-level non-Solr manipulation of the indexes. > > The Flume fork just writes directly to the Solr nodes. It knows about the > ZooKeeper > ensemble and the collection too and communicates via SolrJ I'm pretty sure. > > As far as integrating with HDFS, you're right, HA is part of the package. > As far as using > the Solr indexes for analysis, well you can write anything you want to use > the Solr indexes > from anywhere in the M/R world and have them available from anywhere in the > cluster. There's > no real need to even have Solr running, you could use the output from MRIT > and access the > sub-shards with the EmbeddedSolrServer if you wanted, leaving out all the > pesky servlet > container stuff. > > bq: So why we go for HDFS in the case of analysis if we want to use SolrJ > for this purpose? > What is the point? > > Scale and data access in a nutshell. In the HDFS world, you can scale > pretty linearly > with the number of nodes you can rack together. > > Frankly though, if your data set is small enough to fit on a single machine > _and_ you can get > through your analysis in a reasonable time (reasonable here is up to you), > then HDFS > is probably not worth the hassle. But in the big data world where we're > talking petabyte scale, > having HDFS as the underpinning opens up possibilities for working on data > that were > difficult/impossible with Solr previously. > > Best, > Erick > > > > On Tue, Aug 5, 2014 at 9:37 PM, Ali Nazemian > wrote: > > > Dear Erick, > > I remembered some times ago, somebody asked about what is the point of > > modify Solr to use HDFS for storing indexes. As far as I remember > somebody > > told him integrating Solr with HDFS has two advantages. 1) having hadoop > > replication and HA. 2) using indexes and Solr documents for other > purposes > > such as Analysis. So why we go for HDFS in the case of analysis if we > want > > to use SolrJ for this purpose? What is the point? > > Regards. > > > > > > On Wed, Aug 6, 2014 at 8:59 AM, Ali Nazemian > > wrote: > > > > > Dear Erick, > > > Hi, > > > Thank you for you reply. Yeah I am aware that SolrJ is my last option. > I > > > was thinking about raw I/O operation. So according to your reply > probably > > > it is not applicable somehow. What about the Lily project that Michael > > > mentioned? Is that consider SolrJ too? Are you aware of Cloudera > search? > > I > > > know they provide an integrated Hadoop ecosystem. Do you know what is > > their > > > suggestion? > > > Best regards. > > > > > > > > > > > > On Wed, Aug 6, 2014 at 12:28 AM, Erick Erickson < > erickerick...@gmail.com > > > > > > wrote: > > > > > >> What you haven't told us is what you mean by "modify the > > >> index outside Solr". SolrJ? Using raw Lucene? Trying to modify > > >> things by writing your own codec? Standard Java I/O operations? > > >> Other? > > >> > > >> You could use SolrJ to connect to an existing Solr server and > > >> both read and modify at will form your M/R jobs. But if you're > > >> thinking of trying to write/modify the segment files by raw I/O > > >> operations, good luck! I'm 99.99% certain that's going to cause > > >&g
Re: solr over hdfs for accessing/ changing indexes outside solr
Dear Erick, Could you please name those problems that SolrCloud can not tackle them alone? Maybe I need solrCloud+ Hadoop and I am not aware of that yet. Regards. On Thu, Aug 7, 2014 at 7:37 PM, Erick Erickson wrote: > If SolrCloud meets your needs, without Hadoop, then > there's no real reason to introduce the added complexity. > > There are a bunch of problems that do _not_ work > well with SolrCloud over non-Hadoop file systems. For > those problems, the combination of SolrCloud and Hadoop > make tackling them possible. > > Best, > Erick > > > On Thu, Aug 7, 2014 at 3:55 AM, Ali Nazemian > wrote: > > > Thank you very much. But why we should go for solr distributed with > hadoop? > > There is already solrCloud which is pretty applicable in the case of big > > index. Is there any advantage for sending indexes over map reduce that > > solrCloud can not provide? > > Regards. > > > > > > On Wed, Aug 6, 2014 at 9:09 PM, Erick Erickson > > wrote: > > > > > bq: Are you aware of Cloudera search? I know they provide an integrated > > > Hadoop ecosystem. > > > > > > What Cloudera Search does via the MapReduceIndexerTool (MRIT) is > create N > > > sub-indexes for > > > each shard in the M/R paradigm via EmbeddedSolrServer. Eventually, > these > > > sub-indexes for > > > each shard are merged (perhaps through some number of levels) in the > > reduce > > > phase and > > > maybe merged into a live Solr instance (--go-live). You'll note that > this > > > tool requires the > > > address of the ZK ensemble from which it can get the network topology, > > > configuration files, > > > all that rot. If you don't use the --go-live option, the output is > still > > a > > > Solr index, it's just that > > > the index for each shard is left in a specific directory on HDFS. Being > > on > > > HDFS allows > > > this kind of M/R paradigm for massively parallel indexing operations, > and > > > perhaps massively > > > complex analysis. > > > > > > Nowhere is there any low-level non-Solr manipulation of the indexes. > > > > > > The Flume fork just writes directly to the Solr nodes. It knows about > the > > > ZooKeeper > > > ensemble and the collection too and communicates via SolrJ I'm pretty > > sure. > > > > > > As far as integrating with HDFS, you're right, HA is part of the > package. > > > As far as using > > > the Solr indexes for analysis, well you can write anything you want to > > use > > > the Solr indexes > > > from anywhere in the M/R world and have them available from anywhere in > > the > > > cluster. There's > > > no real need to even have Solr running, you could use the output from > > MRIT > > > and access the > > > sub-shards with the EmbeddedSolrServer if you wanted, leaving out all > the > > > pesky servlet > > > container stuff. > > > > > > bq: So why we go for HDFS in the case of analysis if we want to use > SolrJ > > > for this purpose? > > > What is the point? > > > > > > Scale and data access in a nutshell. In the HDFS world, you can scale > > > pretty linearly > > > with the number of nodes you can rack together. > > > > > > Frankly though, if your data set is small enough to fit on a single > > machine > > > _and_ you can get > > > through your analysis in a reasonable time (reasonable here is up to > > you), > > > then HDFS > > > is probably not worth the hassle. But in the big data world where we're > > > talking petabyte scale, > > > having HDFS as the underpinning opens up possibilities for working on > > data > > > that were > > > difficult/impossible with Solr previously. > > > > > > Best, > > > Erick > > > > > > > > > > > > On Tue, Aug 5, 2014 at 9:37 PM, Ali Nazemian > > > wrote: > > > > > > > Dear Erick, > > > > I remembered some times ago, somebody asked about what is the point > of > > > > modify Solr to use HDFS for storing indexes. As far as I remember > > > somebody > > > > told him integrating Solr with HDFS has two advantages. 1) having > > hadoop > > > > replication and HA. 2) using indexes and Solr documents for other > > > purposes > > > > such as Analysis. So
Send nested doc with solrJ
Dear all, Hi, I was wondering how can I use solrJ for sending nested document to solr? Unfortunately I did not find any tutorial for this purpose. I really appreciate if you can guide me through that. Thank you very much. Best regards. -- A.Nazemian
boosting words from specific list
Dear all, Hi, I was wondering how can I implement solr boosting words from specific list of important words? I mean I want to have a list of important words and tell solr to score documents based on the weighted sum of these words. For example let word "school" has weight of 2 and word "president" has the weight of 5. In this case a doc with 2 "school" words and 3 "president" words will has the total score of 19! I want to sort documents based on this score. How such procedure is possible in solr? Thank you very much. Best regards. -- A.Nazemian
solrJ bug related to solrJ 4.10 for having both incremental partial update and child document on the same solr document!
Dear all, Hi, Right now I face with the strange problem related to solJ client: When I use only incremental partial update. The incremental partial update works fine. When I use only the add child documents. It works perfectly and the child documents added successfully. But when I have both of them in solrInputDocument the adding child documents did not work. I think that the solr.add(document) method can not work when you have both incremental partial update and child document in your solr document. So probably it is a bug related to solj. Would you please consider this situation? Thank you very much. Best regards. -- A.Nazemian
Re: solrJ bug related to solrJ 4.10 for having both incremental partial update and child document on the same solr document!
I also check both solr log and solr console. There is no error inside that, it seems that every thing is fine! But actually there is not any child document after executing process. On Mon, Sep 29, 2014 at 1:47 PM, Ali Nazemian wrote: > Dear all, > Hi, > Right now I face with the strange problem related to solJ client: > When I use only incremental partial update. The incremental partial update > works fine. When I use only the add child documents. It works perfectly and > the child documents added successfully. But when I have both of them in > solrInputDocument the adding child documents did not work. I think that the > solr.add(document) method can not work when you have both incremental > partial update and child document in your solr document. So probably it is > a bug related to solj. Would you please consider this situation? > Thank you very much. > Best regards. > > -- > A.Nazemian > -- A.Nazemian
Re: boosting words from specific list
Dear Koji, Hi, Thank you very much. Do you know any example code for UpdateRequestProcessor? Anything would be appreciated. Best regards. On Tue, Sep 30, 2014 at 3:41 AM, Koji Sekiguchi wrote: > Hi Ali, > > I don't think Solr has such function OOTB. One way I can think of is that > you can implement UpdateRequestProcessor. In processAdd() method of > the UpdateRequestProcessor, as you can read field values, you can calculate > the total score and copy the total score to a field e.g. total_score. > Then you can sort the query result on total_score field when you query. > > Koji > -- > http://soleami.com/blog/comparing-document-classification-functions-of- > lucene-and-mahout.html > > > (2014/09/29 4:25), Ali Nazemian wrote: > >> Dear all, >> Hi, >> I was wondering how can I implement solr boosting words from specific list >> of important words? I mean I want to have a list of important words and >> tell solr to score documents based on the weighted sum of these words. For >> example let word "school" has weight of 2 and word "president" has the >> weight of 5. In this case a doc with 2 "school" words and 3 "president" >> words will has the total score of 19! I want to sort documents based on >> this score. How such procedure is possible in solr? Thank you very much. >> Best regards. >> >> > > > -- A.Nazemian
Re: boosting words from specific list
Dear Koji, Also would you please tell me how can I access the term frequency for each word? Should I do a word count on content or Is it possible to have access to reverse index information to make the process more efficient? I dont want to add too much time to the time of indexing documents. On Tue, Sep 30, 2014 at 7:07 PM, Ali Nazemian wrote: > Dear Koji, > Hi, > Thank you very much. > Do you know any example code for UpdateRequestProcessor? Anything would be > appreciated. > Best regards. > > On Tue, Sep 30, 2014 at 3:41 AM, Koji Sekiguchi > wrote: > >> Hi Ali, >> >> I don't think Solr has such function OOTB. One way I can think of is that >> you can implement UpdateRequestProcessor. In processAdd() method of >> the UpdateRequestProcessor, as you can read field values, you can >> calculate >> the total score and copy the total score to a field e.g. total_score. >> Then you can sort the query result on total_score field when you query. >> >> Koji >> -- >> http://soleami.com/blog/comparing-document-classification-functions-of- >> lucene-and-mahout.html >> >> >> (2014/09/29 4:25), Ali Nazemian wrote: >> >>> Dear all, >>> Hi, >>> I was wondering how can I implement solr boosting words from specific >>> list >>> of important words? I mean I want to have a list of important words and >>> tell solr to score documents based on the weighted sum of these words. >>> For >>> example let word "school" has weight of 2 and word "president" has the >>> weight of 5. In this case a doc with 2 "school" words and 3 "president" >>> words will has the total score of 19! I want to sort documents based on >>> this score. How such procedure is possible in solr? Thank you very much. >>> Best regards. >>> >>> >> >> >> > > > -- > A.Nazemian > -- A.Nazemian
Re: solrJ bug related to solrJ 4.10 for having both incremental partial update and child document on the same solr document!
Did anybody test that? Best regards. On Mon, Sep 29, 2014 at 2:05 PM, Ali Nazemian wrote: > I also check both solr log and solr console. There is no error inside > that, it seems that every thing is fine! But actually there is not any > child document after executing process. > > > On Mon, Sep 29, 2014 at 1:47 PM, Ali Nazemian > wrote: > >> Dear all, >> Hi, >> Right now I face with the strange problem related to solJ client: >> When I use only incremental partial update. The incremental partial >> update works fine. When I use only the add child documents. It works >> perfectly and the child documents added successfully. But when I have both >> of them in solrInputDocument the adding child documents did not work. I >> think that the solr.add(document) method can not work when you have both >> incremental partial update and child document in your solr document. So >> probably it is a bug related to solj. Would you please consider this >> situation? >> Thank you very much. >> Best regards. >> >> -- >> A.Nazemian >> > > > > -- > A.Nazemian > -- A.Nazemian
duplicate unique key after partial update in solr 4.10
Dear all, Hi, I am going to do partial update on a field that has not any value. Suppose I have a document with document id (unique key) '12345' and field "read_flag" which does not index at the first place. So the read_flag field for this document has not any value. After I did partial update to this document to set "read_flag"="true", I faced strange problem. Next time I indexed same document with same values I saw two different version of document with id '12345' in solr. One of them with read_flag=true and another one without read_flag field! I dont want to have duplicate documents (as it should not to be because of unique_key id). Would you please tell me what caused such problem? Best regards. -- A.Nazemian
Re: duplicate unique key after partial update in solr 4.10
Dear Alex, Hi, LOL, yeah I am sure. You can test it yourself. I did that on default schema too. The results are same! Regards. On Mon, Oct 6, 2014 at 4:20 PM, Alexandre Rafalovitch wrote: > A stupid question: Are you sure that what schema thinks your uniqueId > is - is the uniqueId in your setup? Also, that you are not somehow > using the flags to tell Solr to ignore duplicates? > > Regards, >Alex. > Personal: http://www.outerthoughts.com/ and @arafalov > Solr resources and newsletter: http://www.solr-start.com/ and @solrstart > Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 > > > On 6 October 2014 03:40, Ali Nazemian wrote: > > Dear all, > > Hi, > > I am going to do partial update on a field that has not any value. > Suppose > > I have a document with document id (unique key) '12345' and field > > "read_flag" which does not index at the first place. So the read_flag > field > > for this document has not any value. After I did partial update to this > > document to set "read_flag"="true", I faced strange problem. Next time I > > indexed same document with same values I saw two different version of > > document with id '12345' in solr. One of them with read_flag=true and > > another one without read_flag field! I dont want to have duplicate > > documents (as it should not to be because of unique_key id). Would you > > please tell me what caused such problem? > > Best regards. > > > > -- > > A.Nazemian > -- A.Nazemian
Re: duplicate unique key after partial update in solr 4.10
The list of docs before do partial update: product01 car product part01 wheels part part02 engine part part03 brakes part product02 truck product part04 wheels part part05 flaps part The list of docs after doing partial update of field read_flag for document "product01": product01 car product true part01 wheels part part02 engine part part03 brakes part product02 truck product part04 wheels part part05 flaps part The list of documents after sending same documents again. (it should overwrite on the last one because of duplicate IDs) product01 car product true product01 car product part01 wheels part part02 engine part part03 brakes part product02 truck product part04 wheels part part05 flaps part But as you can see there are two different version of documents with the same ID (which is product01). Regards. On Mon, Oct 6, 2014 at 8:18 PM, Alexandre Rafalovitch wrote: > Can you upload the update documents then (into a Gist or similar). > Just so that people didn't have to re-imagine exact steps. Because, if > it fully checks out, it might be a bug and the next step would be > creating a JIRA ticket. > > Regards, >Alex. > Personal: http://www.outerthoughts.com/ and @arafalov > Solr resources and newsletter: http://www.solr-start.com/ and @solrstart > Solr popularizers community: https://www.linkedin.com/groups?gid=6713853 > > > On 6 October 2014 11:23, Ali Nazemian wrote: > > Dear Alex, > > Hi, > > LOL, yeah I am sure. You can test it yourself. I did that on default > schema > > too. The results are same! > > Regards. > > > > On Mon, Oct 6, 2014 at 4:20 PM, Alexandre Rafalovitch < > arafa...@gmail.com> > > wrote: > > > >> A stupid question: Are you sure that what schema thinks your uniqueId > >> is - is the uniqueId in your setup? Also, that you are not somehow > >> using the flags to tell Solr to ignore duplicates? > >> > >> Regards, > >>Alex. > >> Personal: http://www.outerthoughts.com/ and @arafalov > >> Solr resources and newsletter: http://www.solr-start.com/ and > @solrstart > >> Solr popularizers community: > https://www.linkedin.com/groups?gid=6713853 > >> > >> > >> On 6 October 2014 03:40, Ali Nazemian wrote: > >> > Dear all, > >> > Hi, > >> > I am going to do partial update on a field that has not any value. > >> Suppose > >> > I have a document with document id (unique key) '12345' and field > >> > "read_flag" which does not index at the first place. So the read_flag > >> field > >> > for this document has not any value. After I did partial update to > this > >> > document to set "read_flag"="true", I faced strange problem. Next > time I > >> > indexed same document with same values I saw two different version of > >> > document with id '12345' in solr. One of them with read_flag=true and > >> > another one without read_flag field! I dont want to have duplicate > >> > documents (as it should not to be because of unique_key id). Would you > >> > please tell me what caused such problem? > >> > Best regards. > >> > > >> > -- > >> > A.Nazemian > >> > > > > > > > > -- > > A.Nazemian > -- A.Nazemian
import solr source to eclipse
Hi, I am going to import solr source code to eclipse for some development purpose. Unfortunately every tutorial that I found for this purpose is outdated and did not work. So would you please give me some hint about how can I import solr source code to eclipse? Thank you very much. -- A.Nazemian
Re: import solr source to eclipse
Thank you very much for your guides but how can I run solr server inside eclipse? Best regards. On Mon, Oct 13, 2014 at 8:02 PM, Rajani Maski wrote: > Hi, > > The best tutorial for setting up Solr[solr 4.7] in eclipse/intellij is > documented in Solr In Action book, Apendix A, *Working with the Solr > codebase* > > > On Mon, Oct 13, 2014 at 6:45 AM, Tomás Fernández Löbbe < > tomasflo...@gmail.com> wrote: > > > The way I do this: > > From a terminal: > > svn checkout https://svn.apache.org/repos/asf/lucene/dev/trunk/ > > lucene-solr-trunk > > cd lucene-solr-trunk > > ant eclipse > > > > ... And then, from your Eclipse "import existing java project", and > select > > the directory where you placed lucene-solr-trunk > > > > On Sun, Oct 12, 2014 at 7:09 AM, Ali Nazemian > > wrote: > > > > > Hi, > > > I am going to import solr source code to eclipse for some development > > > purpose. Unfortunately every tutorial that I found for this purpose is > > > outdated and did not work. So would you please give me some hint about > > how > > > can I import solr source code to eclipse? > > > Thank you very much. > > > > > > -- > > > A.Nazemian > > > > > > -- A.Nazemian
mark solr documents as duplicates on hashing the combination of some fields
Dear all, Hi, I was wondering how can I mark some documents as duplicate (just marking for future usage not deleting) based on the hash combination of some fields? Suppose I have 2 fields name "url" and "title" I want to create hash based on url+title and send it to another field name "signature". If I do that using solr dedup, it will be resulted to deleting duplicate documents! So it is not applicable for my situation. Thank you very much. Best regards. -- A.Nazemian
having Solr deduplication and partial update
Hi, I was wondering how can I have both solr deduplication and partial update. I found out that due to some reasons you can not rely on solr deduplication when you try to update a document partially! It seems that when you do partial update on some field- even if that field does not consider as duplication field- solr signature created by deduplication will be inapplicable! Is there anyway I can have both deduplication and partial update? Thank you very much. -- A.Nazemian
Re: mark solr documents as duplicates on hashing the combination of some fields
The problem is when I partially update some fields of document. The signature becomes useless! Even if the updated fields are not included in the signatureField! Regards. On Wed, Oct 22, 2014 at 12:44 AM, Chris Hostetter wrote: > > you can still use the SignatureUpdateProcessorFactory for your usecase, > just don't configure teh signatureField to be the same as your uniqueKey > field. > > configure some othe fieldname (ie "signature") instead. > > > : Date: Tue, 14 Oct 2014 12:08:26 +0330 > : From: Ali Nazemian > : Reply-To: solr-user@lucene.apache.org > : To: "solr-user@lucene.apache.org" > : Subject: mark solr documents as duplicates on hashing the combination of > some > : fields > : > : Dear all, > : Hi, > : I was wondering how can I mark some documents as duplicate (just marking > : for future usage not deleting) based on the hash combination of some > : fields? Suppose I have 2 fields name "url" and "title" I want to create > : hash based on url+title and send it to another field name "signature". > If I > : do that using solr dedup, it will be resulted to deleting duplicate > : documents! So it is not applicable for my situation. Thank you very much. > : Best regards. > : > : -- > : A.Nazemian > : > > -Hoss > http://www.lucidworks.com/ > -- A.Nazemian
Re: mark solr documents as duplicates on hashing the combination of some fields
I meant signature will be broken. For example suppose the destination of hash function for signature fields are "sig". After each partial update it becomes: "00"! On Wed, Oct 22, 2014 at 2:59 PM, Alexandre Rafalovitch wrote: > What do you mean by 'useless' specifically on the business level? > > Regards, > Alex > On 22/10/2014 7:27 am, "Ali Nazemian" wrote: > > > The problem is when I partially update some fields of document. The > > signature becomes useless! Even if the updated fields are not included in > > the signatureField! > > Regards. > > > > On Wed, Oct 22, 2014 at 12:44 AM, Chris Hostetter < > > hossman_luc...@fucit.org> > > wrote: > > > > > > > > you can still use the SignatureUpdateProcessorFactory for your usecase, > > > just don't configure teh signatureField to be the same as your > uniqueKey > > > field. > > > > > > configure some othe fieldname (ie "signature") instead. > > > > > > > > > : Date: Tue, 14 Oct 2014 12:08:26 +0330 > > > : From: Ali Nazemian > > > : Reply-To: solr-user@lucene.apache.org > > > : To: "solr-user@lucene.apache.org" > > > : Subject: mark solr documents as duplicates on hashing the combination > > of > > > some > > > : fields > > > : > > > : Dear all, > > > : Hi, > > > : I was wondering how can I mark some documents as duplicate (just > > marking > > > : for future usage not deleting) based on the hash combination of some > > > : fields? Suppose I have 2 fields name "url" and "title" I want to > create > > > : hash based on url+title and send it to another field name > "signature". > > > If I > > > : do that using solr dedup, it will be resulted to deleting duplicate > > > : documents! So it is not applicable for my situation. Thank you very > > much. > > > : Best regards. > > > : > > > : -- > > > : A.Nazemian > > > : > > > > > > -Hoss > > > http://www.lucidworks.com/ > > > > > > > > > > > -- > > A.Nazemian > > > -- A.Nazemian
Hardware requirement for 500 million documents
Hi, I was wondering what is the hardware requirement for indexing 500 million documents in Solr? Suppose maximum number of concurrent users in peak time would be 20. Thank you very much. -- A.Nazemian
Extending solr analysis in index time
Hi everybody, I am going to add some analysis to Solr at the index time. Here is what I am considering in my mind: Suppose I have two different fields for Solr schema, field "a" and field "b". I am going to use the created reverse index in a way that some terms are considered as important ones and tell lucene to calculate a value based on these terms frequency per each document. For example let the word "hello" considered as important word with the weight of "2.0". Suppose the term frequency for this word at field "a" is 3 and at field "b" is 6 for document 1. Therefor the score value would be 2*3+(2*6)^2. I want to calculate this score based on these fields and put it in the index for retrieving. My question would be how can I do such thing? First I did consider using term component for calculating this value from outside and put it back to Solr index, but it seems it is not efficient enough. Thank you very much. Best regards. -- A.Nazemian
Re: Extending solr analysis in index time
Dear Jack, Hi, I think you misunderstood my need. I dont want to change the default scoring behavior of Lucene (tf-idf) I just want to have another field to do sorting for some specific queries (not all the search business), however I am aware of Lucene payload. Thank you very much. On Sun, Jan 11, 2015 at 7:15 PM, Jack Krupansky wrote: > You would do that with a custom similarity (scoring) class. That's an > expert feature. In fact a SUPER-expert feature. > > Start by completely familiarizing yourself with how TF*IDF similarity > already works: > > http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html > > And to use your custom similarity class in Solr: > > https://cwiki.apache.org/confluence/display/solr/Other+Schema+Elements#OtherSchemaElements-Similarity > > > -- Jack Krupansky > > On Sun, Jan 11, 2015 at 9:04 AM, Ali Nazemian > wrote: > > > Hi everybody, > > > > I am going to add some analysis to Solr at the index time. Here is what I > > am considering in my mind: > > Suppose I have two different fields for Solr schema, field "a" and field > > "b". I am going to use the created reverse index in a way that some terms > > are considered as important ones and tell lucene to calculate a value > based > > on these terms frequency per each document. For example let the word > > "hello" considered as important word with the weight of "2.0". Suppose > the > > term frequency for this word at field "a" is 3 and at field "b" is 6 for > > document 1. Therefor the score value would be 2*3+(2*6)^2. I want to > > calculate this score based on these fields and put it in the index for > > retrieving. My question would be how can I do such thing? First I did > > consider using term component for calculating this value from outside and > > put it back to Solr index, but it seems it is not efficient enough. > > > > Thank you very much. > > Best regards. > > > > -- > > A.Nazemian > > > -- A.Nazemian
Re: Extending solr analysis in index time
Dear Alexandre, I did not tried updaterequestprocessor yet. Can I access to term frequencies at this level? I dont want to calculate term frequencies once more while lucene already calculate them in reverse index? Thank you very much. On Jan 11, 2015 7:49 PM, "Alexandre Rafalovitch" wrote: > Your description uses the terms Solr/Lucene uses but perhaps not in > the same way we do. That might explain the confusion. > > It sounds - on a high level - that you want to create a field based on > a combination of a couple of other fields during indexing stage. Have > you tried UpdateRequestProcessors? They have access to the full > document when it is sent and can do whatever they want with it. > > Regards, >Alex. > > Sign up for my Solr resources newsletter at http://www.solr-start.com/ > > > On 11 January 2015 at 10:55, Ali Nazemian wrote: > > Dear Jack, > > Hi, > > I think you misunderstood my need. I dont want to change the default > > scoring behavior of Lucene (tf-idf) I just want to have another field to > do > > sorting for some specific queries (not all the search business), however > I > > am aware of Lucene payload. > > Thank you very much. > > > > On Sun, Jan 11, 2015 at 7:15 PM, Jack Krupansky < > jack.krupan...@gmail.com> > > wrote: > > > >> You would do that with a custom similarity (scoring) class. That's an > >> expert feature. In fact a SUPER-expert feature. > >> > >> Start by completely familiarizing yourself with how TF*IDF similarity > >> already works: > >> > >> > http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html > >> > >> And to use your custom similarity class in Solr: > >> > >> > https://cwiki.apache.org/confluence/display/solr/Other+Schema+Elements#OtherSchemaElements-Similarity > >> > >> > >> -- Jack Krupansky > >> > >> On Sun, Jan 11, 2015 at 9:04 AM, Ali Nazemian > >> wrote: > >> > >> > Hi everybody, > >> > > >> > I am going to add some analysis to Solr at the index time. Here is > what I > >> > am considering in my mind: > >> > Suppose I have two different fields for Solr schema, field "a" and > field > >> > "b". I am going to use the created reverse index in a way that some > terms > >> > are considered as important ones and tell lucene to calculate a value > >> based > >> > on these terms frequency per each document. For example let the word > >> > "hello" considered as important word with the weight of "2.0". Suppose > >> the > >> > term frequency for this word at field "a" is 3 and at field "b" is 6 > for > >> > document 1. Therefor the score value would be 2*3+(2*6)^2. I want to > >> > calculate this score based on these fields and put it in the index for > >> > retrieving. My question would be how can I do such thing? First I did > >> > consider using term component for calculating this value from outside > and > >> > put it back to Solr index, but it seems it is not efficient enough. > >> > > >> > Thank you very much. > >> > Best regards. > >> > > >> > -- > >> > A.Nazemian > >> > > >> > > > > > > > > -- > > A.Nazemian >
Re: Extending solr analysis in index time
Dear Jack, Thank you very much. Yeah I was thinking of function query for sorting, but I have to problems in this case, 1) function query do the process at query time which I dont want to. 2) I also want to have the score field for retrieving and showing to users. Dear Alexandre, Here is some more explanation about the business behind the question: I am going to provide a field for each document, lets refer it as "document_score". I am going to fill this field based on the information that could be extracted from Lucene reverse index. Assume I have a list of terms, called important terms and I am going to extract the term frequency for each of the terms inside this list per each document. To be honest I want to use the term frequency for calculating "document_score". "document_score" should be storable since I am going to retrieve this field for each document. I also want to do sorting on "document_store" in case of preferred by user. I hope I did convey my point. Best regards. On Mon, Jan 12, 2015 at 12:53 AM, Jack Krupansky wrote: > Won't function queries do the job at query time? You can add or multiply > the tf*idf score by a function of the term frequency of arbitrary terms, > using the tf, mul, and add functions. > > See: > https://cwiki.apache.org/confluence/display/solr/Function+Queries > > -- Jack Krupansky > > On Sun, Jan 11, 2015 at 10:55 AM, Ali Nazemian > wrote: > > > Dear Jack, > > Hi, > > I think you misunderstood my need. I dont want to change the default > > scoring behavior of Lucene (tf-idf) I just want to have another field to > do > > sorting for some specific queries (not all the search business), however > I > > am aware of Lucene payload. > > Thank you very much. > > > > On Sun, Jan 11, 2015 at 7:15 PM, Jack Krupansky < > jack.krupan...@gmail.com> > > wrote: > > > > > You would do that with a custom similarity (scoring) class. That's an > > > expert feature. In fact a SUPER-expert feature. > > > > > > Start by completely familiarizing yourself with how TF*IDF similarity > > > already works: > > > > > > > > > http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html > > > > > > And to use your custom similarity class in Solr: > > > > > > > > > https://cwiki.apache.org/confluence/display/solr/Other+Schema+Elements#OtherSchemaElements-Similarity > > > > > > > > > -- Jack Krupansky > > > > > > On Sun, Jan 11, 2015 at 9:04 AM, Ali Nazemian > > > wrote: > > > > > > > Hi everybody, > > > > > > > > I am going to add some analysis to Solr at the index time. Here is > > what I > > > > am considering in my mind: > > > > Suppose I have two different fields for Solr schema, field "a" and > > field > > > > "b". I am going to use the created reverse index in a way that some > > terms > > > > are considered as important ones and tell lucene to calculate a value > > > based > > > > on these terms frequency per each document. For example let the word > > > > "hello" considered as important word with the weight of "2.0". > Suppose > > > the > > > > term frequency for this word at field "a" is 3 and at field "b" is 6 > > for > > > > document 1. Therefor the score value would be 2*3+(2*6)^2. I want to > > > > calculate this score based on these fields and put it in the index > for > > > > retrieving. My question would be how can I do such thing? First I did > > > > consider using term component for calculating this value from outside > > and > > > > put it back to Solr index, but it seems it is not efficient enough. > > > > > > > > Thank you very much. > > > > Best regards. > > > > > > > > -- > > > > A.Nazemian > > > > > > > > > > > > > > > -- > > A.Nazemian > > > -- A.Nazemian