from:"Ali Nazemian"

Solr JSON facet range out of memory exception

2016-04-10 Thread Ali Nazemian

Dear all Solr users/developeres,
Hi,
I am going to use Solr JSON facet range on a date filed which is stored as
long milis. Unfortunately I got java heap space exception no matter how
much memory assigned to Solr Java heap! I already test that with 2g heap
space for Solr core with 50k documents!! Is there any performance concern
regarding the JSON facet range on long field?

Sincerely,

-- 
A.Nazemian

Re: Solr JSON facet range out of memory exception

2016-04-11 Thread Ali Nazemian

Dear Yonik,
Hi,

The entire index has 50k documents not the faceted one.  It is just a test
case right now! I used the JSON facet API, here is my query after encoding:

http: //10.102.1.5: 8983/solr/edgeIndex/select?q=*%3A*&fq=stat_owner_id:
122952&rows=0&wt=json&indent=true&facet=true&json.facet=%7bresult: %7b
type: range,
field: stat_date,
start: 146027158386,
end: 1460271583864,
gap: 1
%7d%7d

Sincerely,


On Sun, Apr 10, 2016 at 4:56 PM, Yonik Seeley  wrote:

> On Sun, Apr 10, 2016 at 3:47 AM, Ali Nazemian 
> wrote:
> > Dear all Solr users/developeres,
> > Hi,
> > I am going to use Solr JSON facet range on a date filed which is stored
> as
> > long milis. Unfortunately I got java heap space exception no matter how
> > much memory assigned to Solr Java heap! I already test that with 2g heap
> > space for Solr core with 50k documents!!
>
> You mean the entire index is 50K documents? Or do you mean the subset
> of documents to be faceted?
> If you're getting an OOM with the former (with a 2G heap), it sounds
> like you've hit some sort of bug.
>
> What does your faceting command look like?
>
> -Yonik
>



-- 
A.Nazemian

Solr facet using gap function

2016-04-13 Thread Ali Nazemian

Dear all,
Hi,

I am wondering, is there any way to introduce and add a function for facet
gap parameter? I already know there are some Date Math that can be used.
(Such as DAY, MONTH, and etc.) I want to add some functions and try to use
them as gap in facet range; Is it possible?

Sincerely,
Ali.

Searching for term sequence including blank character using regex

2016-05-01 Thread Ali Nazemian

Dear Solr Users/Developers,
Hi,

I was wondering what is the correct query syntax for searching sequence of
terms with blank character in the middle of sequence. Suppose I am looking
for a query syntax with using fq parameter. For example suppose I want to
search for all documents having "hello world" sequence using fq parameter.
I am not sure why using fq=content:/.*hello world.*/ did not works for
tokenized field in this situation. However, fq=content:/.*hello.*/ did work
for the same field. Is there any possible fq query syntax for such
searching requirement?

Best regards.


-- 
A.Nazemian

Solr re-indexing in case of store=false

2016-05-08 Thread Ali Nazemian

Dear all,
Hi,
I was wondering, is it possible to re-index Solr 6.0 data in case of
store=false? I am using Solr as a secondary datastore, and for the sake of
space efficiency all the fields (except id) are considered as store=false.
Currently, due to some changes in application business, Solr schema should
change, and in order to see the effect of changing schema on old data, I
have to do the re-index process.  I know that one way of re-indexing in
Solr is reading data from one collection (core) and inserting that to
another one, but this solution is not possible for store=false fields, and
re-indexing the whole data through primary datastore is kind of costly, so
I would be grateful if somebody could introduce other way of re-indexing
the whole data without using another datastore.

Sincerely,

-- 
A.Nazemian

Re: Solr re-indexing in case of store=false

2016-05-09 Thread Ali Nazemian

Dear Erick,
Hi,
Thank you very much. About the storing part you are right, unless the
primary datastore uses some kind of data compression which in my case it
does (I am using Cassandra as a primary datastore), and I am not sure about
Solr that it has any kind of compression or not.
According to your reply, it seems that I have to do that in a hard way.  I
mean using the primary datastore to build the index from scratch.

Sincerely,

On Sun, May 8, 2016 at 11:07 PM, Erick Erickson 
wrote:

> bq: I would be grateful if somebody could introduce other way of
> re-indexing
> the whole data without using another datastore
>
> Not possible currently. Consider what's _in_ the index when stored="false".
> The actual terms are the output of the entire analysis chain, including
> stemming, stopword removal, synonym substitution etc. Since the
> indexing process is lossy, you simply cannot reconstruct the original
> stream from the indexed terms.
>
> I suppose one _could_ do this in the case of docValues only index with
> the new return-values-from-docvalues functionality, but even that's lossy
> because the order of returned values may not be the original insertion
> order. And if that suits your needs, a pretty simple driver program would
> suffice.
>
> To do this from indexed-only terms you'd have to somehow store the
> original version of each term or store some codes indicating exactly
> how to reconstruct the original steam, which very possibly would take
> up as much space as if you'd just stored the values anyway. _And_ it
> would burden every one else who didn't want to do this with a bloated
> index.
>
> Best,
> Erick
>
> On Sun, May 8, 2016 at 4:25 AM, Ali Nazemian 
> wrote:
> > Dear all,
> > Hi,
> > I was wondering, is it possible to re-index Solr 6.0 data in case of
> > store=false? I am using Solr as a secondary datastore, and for the sake
> of
> > space efficiency all the fields (except id) are considered as
> store=false.
> > Currently, due to some changes in application business, Solr schema
> should
> > change, and in order to see the effect of changing schema on old data, I
> > have to do the re-index process.  I know that one way of re-indexing in
> > Solr is reading data from one collection (core) and inserting that to
> > another one, but this solution is not possible for store=false fields,
> and
> > re-indexing the whole data through primary datastore is kind of costly,
> so
> > I would be grateful if somebody could introduce other way of re-indexing
> > the whole data without using another datastore.
> >
> > Sincerely,
> >
> > --
> > A.Nazemian
>



-- 
A.Nazemian

java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content

2015-07-20 Thread Ali Nazemian

Dears,
Hi,
I have a collection of 1.6m documents in Solr 5.2.1. When I use facet on
field of content this error will appear after around 30s of trying to
return the results:

null:org.apache.solr.common.SolrException: Exception during facet.field: content
at org.apache.solr.request.SimpleFacets$3.call(SimpleFacets.java:632)
at org.apache.solr.request.SimpleFacets$3.call(SimpleFacets.java:617)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at org.apache.solr.request.SimpleFacets$2.execute(SimpleFacets.java:571)
at 
org.apache.solr.request.SimpleFacets.getFacetFieldCounts(SimpleFacets.java:642)
at 
org.apache.solr.request.SimpleFacets.getFacetCounts(SimpleFacets.java:285)
at 
org.apache.solr.handler.component.FacetComponent.process(FacetComponent.java:102)
at 
org.apache.solr.handler.component.SearchHandler.handleRequestBody(SearchHandler.java:255)
at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)
at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:497)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.IllegalStateException: Too many values for
UnInvertedField faceting on field content
at 
org.apache.lucene.uninverting.DocTermOrds.uninvert(DocTermOrds.java:509)
at 
org.apache.lucene.uninverting.DocTermOrds.(DocTermOrds.java:215)
at 
org.apache.lucene.uninverting.DocTermOrds.(DocTermOrds.java:206)
at 
org.apache.lucene.uninverting.DocTermOrds.(DocTermOrds.java:199)
at 
org.apache.lucene.uninverting.FieldCacheImpl$DocTermOrdsCache.createValue(FieldCacheImpl.java:946)
at 
org.apache.lucene.uninverting.FieldCacheImpl$Cache.get(FieldCacheImpl.java:190)
at 
org.apache.lucene.uninverting.FieldCacheImpl.getDocTermOrds(FieldCacheImpl.java:933)
at 
org.apache.lucene.uninverting.UninvertingReader.getSortedSetDocValues(UninvertingReader.java:275)
at 
org.apache.lucene.index.FilterLeafReader.getSortedSetDocValues(FilterLeafReader.java:454)
at 
org.apache.lucene.index.MultiDocValues.getSortedSetValues(MultiDocValues.java:356)
at 
org.apache.lucene.index.SlowCompositeReaderWrapper.getSortedSetDocValues(SlowCompositeReaderWrapper.java:165)
at 
org.apache.solr.request.DocValuesFacets.getCounts(DocValuesFacets.java:72)
at 
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:490)
at 
org.apache.solr.request.SimpleFacets.getTermCounts(SimpleFacets.java:386)
at org.apache.solr.request.SimpleFacets$3.call(SimpleFacets.java:626)
... 33 more


Here is the schema.xml related to content field:




Would you please help me to solve this problem?

Best regards.


-- 
A.Nazemian

Re: java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content

2015-07-20 Thread Ali Nazemian

Dear Toke and Davidphilip,
Hi,
The fieldtype text_fa has some custom language specific normalizer and
charfilter, here is the schema.xml value related for this field:

  





  
  





  


I did try the facet.method=enum and it works fine. Did you mean that
actually applying facet on analyzed field is wrong?

Best regards.

On Mon, Jul 20, 2015 at 8:07 PM, Toke Eskildsen 
wrote:

> Ali Nazemian  wrote:
> > I have a collection of 1.6m documents in Solr 5.2.1.
> > [...]
> > Caused by: java.lang.IllegalStateException: Too many values for
> > UnInvertedField faceting on field content
> > [...]
> >  > default="noval" termVectors="true" termPositions="true"
> > termOffsets="true"/>
>
> You are hitting an internal limit in Solr. As davidphilip tells you, the
> solution is docValues, but they cannot be enabled for text fields. You need
> String fields, but the name of your field suggests that you need
> analyzation & tokenization, which cannot be done on String fields.
>
> > Would you please help me to solve this problem?
>
> With the information we have, it does not seem to be easy to solve: It
> seems like you want to facet on all terms in your index. As they need to be
> String (to use docValues), you would have to do all the splitting on white
> space, normalization etc. outside of Solr.
>
> - Toke Eskildsen
>



-- 
A.Nazemian

Re: java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content

2015-07-20 Thread Ali Nazemian

Dear Erick,

Actually faceting on this field is not a user wanted application. I did
that for the purpose of testing the customized normalizer and charfilter
which I used. Therefore it just used for the purpose of testing. Anyway I
did some googling on this error and It seems that changing facet method to
enum works in other similar cases too. I dont know the differences between
fcs and enum methods on calculating facet behind the scene, but it seems
that enum works better in my case.

Best regards.

On Tue, Jul 21, 2015 at 9:08 AM, Erick Erickson 
wrote:

> This really seems like an XY problem. _Why_ are you faceting on a
> tokenized field?
> What are you really trying to accomplish? Because faceting on a generalized
> content field that's an analyzed field is often A Bad Thing. Try going
> into the
> admin UI>> Schema Browser for that field, and you'll see how many unique
> terms
> you have in that field. Faceting on that many unique terms is rarely
> useful to the
> end user, so my suspicion is that you're not doing what you think you
> are. Or you
> have an unusual use-case. Either way, we need to understand what use-case
> you're trying to support in order to respond helpfully.
>
> You say that using facet.enum works, this is very surprising. That method
> uses
> the filterCache to create a bitset for each unique term. Which is totally
> incompatible with the uninverted field error you're reporting, so I
> clearly don't
> understand something about your setup. Are you _sure_?
>
> Best,
> Erick
>
> On Mon, Jul 20, 2015 at 9:32 PM, Ali Nazemian 
> wrote:
> > Dear Toke and Davidphilip,
> > Hi,
> > The fieldtype text_fa has some custom language specific normalizer and
> > charfilter, here is the schema.xml value related for this field:
> >  positionIncrementGap="100">
> >   
> >  > class="com.ictcert.lucene.analysis.fa.FarsiCharFilterFactory"/>
> > 
> > 
> >  > class="com.ictcert.lucene.analysis.fa.FarsiNormalizationFilterFactory"/>
> >  > words="lang/stopwords_fa.txt" />
> >   
> >   
> >  > class="com.ictcert.lucene.analysis.fa.FarsiCharFilterFactory"/>
> > 
> > 
> >  > class="com.ictcert.lucene.analysis.fa.FarsiNormalizationFilterFactory"/>
> >  > words="lang/stopwords_fa.txt" />
> >   
> > 
> >
> > I did try the facet.method=enum and it works fine. Did you mean that
> > actually applying facet on analyzed field is wrong?
> >
> > Best regards.
> >
> > On Mon, Jul 20, 2015 at 8:07 PM, Toke Eskildsen 
> > wrote:
> >
> >> Ali Nazemian  wrote:
> >> > I have a collection of 1.6m documents in Solr 5.2.1.
> >> > [...]
> >> > Caused by: java.lang.IllegalStateException: Too many values for
> >> > UnInvertedField faceting on field content
> >> > [...]
> >> >  >> > default="noval" termVectors="true" termPositions="true"
> >> > termOffsets="true"/>
> >>
> >> You are hitting an internal limit in Solr. As davidphilip tells you, the
> >> solution is docValues, but they cannot be enabled for text fields. You
> need
> >> String fields, but the name of your field suggests that you need
> >> analyzation & tokenization, which cannot be done on String fields.
> >>
> >> > Would you please help me to solve this problem?
> >>
> >> With the information we have, it does not seem to be easy to solve: It
> >> seems like you want to facet on all terms in your index. As they need
> to be
> >> String (to use docValues), you would have to do all the splitting on
> white
> >> space, normalization etc. outside of Solr.
> >>
> >> - Toke Eskildsen
> >>
> >
> >
> >
> > --
> > A.Nazemian
>



-- 
A.Nazemian

Re: java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content

2015-07-21 Thread Ali Nazemian

Dear Erick,
I found another thing, I did check the number of unique terms for this
field using schema browser, It reported 1683404 number of terms! Does it
exceed the maximum number of unique terms for "fcs" facet method? I read
somewhere it should be more than 16m does it true?!

Best regards.


On Tue, Jul 21, 2015 at 10:00 AM, Ali Nazemian 
wrote:

> Dear Erick,
>
> Actually faceting on this field is not a user wanted application. I did
> that for the purpose of testing the customized normalizer and charfilter
> which I used. Therefore it just used for the purpose of testing. Anyway I
> did some googling on this error and It seems that changing facet method to
> enum works in other similar cases too. I dont know the differences between
> fcs and enum methods on calculating facet behind the scene, but it seems
> that enum works better in my case.
>
> Best regards.
>
> On Tue, Jul 21, 2015 at 9:08 AM, Erick Erickson 
> wrote:
>
>> This really seems like an XY problem. _Why_ are you faceting on a
>> tokenized field?
>> What are you really trying to accomplish? Because faceting on a
>> generalized
>> content field that's an analyzed field is often A Bad Thing. Try going
>> into the
>> admin UI>> Schema Browser for that field, and you'll see how many unique
>> terms
>> you have in that field. Faceting on that many unique terms is rarely
>> useful to the
>> end user, so my suspicion is that you're not doing what you think you
>> are. Or you
>> have an unusual use-case. Either way, we need to understand what use-case
>> you're trying to support in order to respond helpfully.
>>
>> You say that using facet.enum works, this is very surprising. That method
>> uses
>> the filterCache to create a bitset for each unique term. Which is totally
>> incompatible with the uninverted field error you're reporting, so I
>> clearly don't
>> understand something about your setup. Are you _sure_?
>>
>> Best,
>> Erick
>>
>> On Mon, Jul 20, 2015 at 9:32 PM, Ali Nazemian 
>> wrote:
>> > Dear Toke and Davidphilip,
>> > Hi,
>> > The fieldtype text_fa has some custom language specific normalizer and
>> > charfilter, here is the schema.xml value related for this field:
>> > > positionIncrementGap="100">
>> >   
>> > > > class="com.ictcert.lucene.analysis.fa.FarsiCharFilterFactory"/>
>> > 
>> > 
>> > > > class="com.ictcert.lucene.analysis.fa.FarsiNormalizationFilterFactory"/>
>> > > > words="lang/stopwords_fa.txt" />
>> >   
>> >   
>> > > > class="com.ictcert.lucene.analysis.fa.FarsiCharFilterFactory"/>
>> > 
>> > 
>> > > > class="com.ictcert.lucene.analysis.fa.FarsiNormalizationFilterFactory"/>
>> > > > words="lang/stopwords_fa.txt" />
>> >   
>> > 
>> >
>> > I did try the facet.method=enum and it works fine. Did you mean that
>> > actually applying facet on analyzed field is wrong?
>> >
>> > Best regards.
>> >
>> > On Mon, Jul 20, 2015 at 8:07 PM, Toke Eskildsen > >
>> > wrote:
>> >
>> >> Ali Nazemian  wrote:
>> >> > I have a collection of 1.6m documents in Solr 5.2.1.
>> >> > [...]
>> >> > Caused by: java.lang.IllegalStateException: Too many values for
>> >> > UnInvertedField faceting on field content
>> >> > [...]
>> >> > > >> > default="noval" termVectors="true" termPositions="true"
>> >> > termOffsets="true"/>
>> >>
>> >> You are hitting an internal limit in Solr. As davidphilip tells you,
>> the
>> >> solution is docValues, but they cannot be enabled for text fields. You
>> need
>> >> String fields, but the name of your field suggests that you need
>> >> analyzation & tokenization, which cannot be done on String fields.
>> >>
>> >> > Would you please help me to solve this problem?
>> >>
>> >> With the information we have, it does not seem to be easy to solve: It
>> >> seems like you want to facet on all terms in your index. As they need
>> to be
>> >> String (to use docValues), you would have to do all the splitting on
>> white
>> >> space, normalization etc. outside of Solr.
>> >>
>> >> - Toke Eskildsen
>> >>
>> >
>> >
>> >
>> > --
>> > A.Nazemian
>>
>
>
>
> --
> A.Nazemian
>



-- 
A.Nazemian

Optimizing Solr indexing over WAN

2015-07-21 Thread Ali Nazemian

Dears,
Hi,
I know that there are lots of tips about how to make the Solr indexing
faster. Probably some of the most important ones which are considered in
client side are choosing batch indexing and multi-thread indexing. There
are other important factors that are server side which I dont want to
mentioned here. Anyway my question would be is there any best practice for
number of client threads and the size of batch available over WAN network?
Since the client and servers are connected over WAN network probably some
of the performance conditions such as network latency, bandwidth and etc.
are different from LAN network. Another think that is matter for me is the
fact that document sizes are might be different in diverse scenarios. For
example when you want to index web-pages the size of document might be from
1KB to 200KB. In such case choosing batch size according to the number of
documents is probably not the best way of optimizing index performance.
Probably choosing based on the size of batch size in KB/MB would be better
from the network point of view. However, from the Solr side document
numbers matter.
So if I want to summarize my questions here what am I looking for:
1- Is there any best practice available for Solr client side performance
tuning over WAN network for the purpose of indexing/reindexing/updating?
Does it different from LAN network?
2- Which one is matter: number of documents or the total size of documents
in batch?

Best regards.

-- 
A.Nazemian

Re: java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content

2015-07-21 Thread Ali Nazemian

Dear Yonik,
Hi,
Really thanks for you response.
Best regards.

On Tue, Jul 21, 2015 at 5:42 PM, Yonik Seeley  wrote:

> On Tue, Jul 21, 2015 at 3:09 AM, Ali Nazemian 
> wrote:
> > Dear Erick,
> > I found another thing, I did check the number of unique terms for this
> > field using schema browser, It reported 1683404 number of terms! Does it
> > exceed the maximum number of unique terms for "fcs" facet method?
>
> The real limit is not simple since the data is not stored in a simple
> way (it's compressed).
>
> > I read
> > somewhere it should be more than 16m does it true?!
>
> More like 16MB of delta-coded terms per block of documents (the index
> is split up into 256 blocks for this purpose)
>
> See DocTermOrds.java if you want more details than that.
>
> -Yonik
>



-- 
A.Nazemian

Re: java.lang.IllegalStateException: Too many values for UnInvertedField faceting on field content

2015-07-22 Thread Ali Nazemian

Dear Alessandro,
Thank you very much.
Yeah sure it is far better, I did not think of that ;)

Best regards.

On Wed, Jul 22, 2015 at 2:31 PM, Alessandro Benedetti <
benedetti.ale...@gmail.com> wrote:

> In addition to Erick answer :
> I agree 100% on your observations, but I would add that actually, DocValues
> should be provided for all not tokenized fields instead of for all not
> analysed fields.
>
> In the end there will be not practical difference if you build the
> docValues structures for fields that have a keywordTokenizer ( and for
> example a lowercaseTokenFilter following) .
> Some charFilters before and simple token filter after can actually be
> useful when sorting or faceting ( let's simplify those 2 as main uses for
> DocValues) .
>
> Of course relaxing the use of DocValues from primitive types to analysed
> types can be problematic, but there are scenarios where can be a good fit.
> I should study a little bit more in deep, what are the current constraints
> that are blocking docValues to be applied to analysed fields.
>
> Cheers
>
>
> Cheers
>
> 2015-07-21 5:38 GMT+01:00 Erick Erickson :
>
> > This really seems like an XY problem. _Why_ are you faceting on a
> > tokenized field?
> > What are you really trying to accomplish? Because faceting on a
> generalized
> > content field that's an analyzed field is often A Bad Thing. Try going
> > into the
> > admin UI>> Schema Browser for that field, and you'll see how many unique
> > terms
> > you have in that field. Faceting on that many unique terms is rarely
> > useful to the
> > end user, so my suspicion is that you're not doing what you think you
> > are. Or you
> > have an unusual use-case. Either way, we need to understand what use-case
> > you're trying to support in order to respond helpfully.
> >
> > You say that using facet.enum works, this is very surprising. That method
> > uses
> > the filterCache to create a bitset for each unique term. Which is totally
> > incompatible with the uninverted field error you're reporting, so I
> > clearly don't
> > understand something about your setup. Are you _sure_?
> >
> > Best,
> > Erick
> >
> > On Mon, Jul 20, 2015 at 9:32 PM, Ali Nazemian 
> > wrote:
> > > Dear Toke and Davidphilip,
> > > Hi,
> > > The fieldtype text_fa has some custom language specific normalizer and
> > > charfilter, here is the schema.xml value related for this field:
> > >  > positionIncrementGap="100">
> > >   
> > >  > > class="com.ictcert.lucene.analysis.fa.FarsiCharFilterFactory"/>
> > > 
> > > 
> > >  > >
> class="com.ictcert.lucene.analysis.fa.FarsiNormalizationFilterFactory"/>
> > >  > > words="lang/stopwords_fa.txt" />
> > >   
> > >   
> > >  > > class="com.ictcert.lucene.analysis.fa.FarsiCharFilterFactory"/>
> > > 
> > > 
> > >  > >
> class="com.ictcert.lucene.analysis.fa.FarsiNormalizationFilterFactory"/>
> > >  > > words="lang/stopwords_fa.txt" />
> > >   
> > > 
> > >
> > > I did try the facet.method=enum and it works fine. Did you mean that
> > > actually applying facet on analyzed field is wrong?
> > >
> > > Best regards.
> > >
> > > On Mon, Jul 20, 2015 at 8:07 PM, Toke Eskildsen <
> t...@statsbiblioteket.dk>
> > > wrote:
> > >
> > >> Ali Nazemian  wrote:
> > >> > I have a collection of 1.6m documents in Solr 5.2.1.
> > >> > [...]
> > >> > Caused by: java.lang.IllegalStateException: Too many values for
> > >> > UnInvertedField faceting on field content
> > >> > [...]
> > >> >  > >> > default="noval" termVectors="true" termPositions="true"
> > >> > termOffsets="true"/>
> > >>
> > >> You are hitting an internal limit in Solr. As davidphilip tells you,
> the
> > >> solution is docValues, but they cannot be enabled for text fields. You
> > need
> > >> String fields, but the name of your field suggests that you need
> > >> analyzation & tokenization, which cannot be done on String fields.
> > >>
> > >> > Would you please help me to solve this problem?
> > >>
> > >> With the information we have, it does not seem to be easy to solve: It
> > >> seems like you want to facet on all terms in your index. As they need
> > to be
> > >> String (to use docValues), you would have to do all the splitting on
> > white
> > >> space, normalization etc. outside of Solr.
> > >>
> > >> - Toke Eskildsen
> > >>
> > >
> > >
> > >
> > > --
> > > A.Nazemian
> >
>
>
>
> --
> --
>
> Benedetti Alessandro
> Visiting card - http://about.me/alessandro_benedetti
> Blog - http://alexbenedetti.blogspot.co.uk
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England
>



-- 
A.Nazemian

Solr MLT Interestingterms return different terms than Lucene MoreLikeThis for some of the documents

2015-08-12 Thread Ali Nazemian

Hi,

I am going to implement a searchcomponent for Solr to return document main
keywords with using the more like this interesting terms. The main part of
implemented component which uses mlt.retrieveInterestingTerms by lucene
docID does not work for all of the documents. I mean for some of the
documents solr interestingterms returns some useful terms as top tf-idf
terms; however, the implemented method returns null! But for other
documents both results (solr MLT interesting terms and the
mlt.retrieveInterestingTerms(docId)) are the same! Would you please help me
through solving this issue?

public List getKeywords(int docId) throws SyntaxError {
String[] fields = new String[keywordSourceFields.size()];
List terms = new ArrayList();
fields = keywordSourceFields.toArray(fields);
mlt.setFieldNames(fields);
mlt.setAnalyzer(indexSearcher.getSchema().getIndexAnalyzer());
mlt.setMinTermFreq(minTermFreq);
mlt.setMinDocFreq(minDocFreq);
mlt.setMinWordLen(minWordLen);
mlt.setMaxQueryTerms(maxNumKeywords);
mlt.setMaxNumTokensParsed(maxTokensParsed);
try {

  terms = Arrays.asList(mlt.retrieveInterestingTerms(docId));
} catch (IOException e) {
  LOGGER.error(e.getMessage());
  throw new RuntimeException();
}

return terms;
  }

*Note:*

I did define termVectors=true for all the required fields that I am going
to use for the purpose of generating interesting terms (fields array in the
corresponding method)
Best regards.
-- 

A.Nazemian

Solr cross core join special condition

2015-10-06 Thread Ali Nazemian

I was wondering how can I overcome this query requirement in Solr 5.2.1:

I have two different Solr cores refer as "core1" and "core2". core1  has
some fields such as field1, field2 and field3 and core2 has some other
fields such as field1, field4 and field5. I am looking for Solr query which
can return all of the documents requiring field1, field2, field3, field4
and field5 with considering some condition on core1.

For example:
core1:
-field1:123
-field2:"foo"
-field3:"bar"

core2:
-field1:123
-field4:"hello"
-field5:"world"

returning result:
field1:123
field2:"foo"
field3:"bar"
field4:"hello"
field4:"world"

Thank you very much.

Best regards.

-- 
A.Nazemian

Re: Solr cross core join special condition

2015-10-06 Thread Ali Nazemian

Dear Mikhail,
Hi,
I want to enrich the result.
Regards
On Oct 6, 2015 7:07 PM, "Mikhail Khludnev" 
wrote:

> Hello,
>
> Why do you need sibling core fields? do you facet? or just want to enrich
> result page with them?
>
> On Tue, Oct 6, 2015 at 6:04 PM, Ali Nazemian 
> wrote:
>
> > I was wondering how can I overcome this query requirement in Solr 5.2.1:
> >
> > I have two different Solr cores refer as "core1" and "core2". core1  has
> > some fields such as field1, field2 and field3 and core2 has some other
> > fields such as field1, field4 and field5. I am looking for Solr query
> which
> > can return all of the documents requiring field1, field2, field3, field4
> > and field5 with considering some condition on core1.
> >
> > For example:
> > core1:
> > -field1:123
> > -field2:"foo"
> > -field3:"bar"
> >
> > core2:
> > -field1:123
> > -field4:"hello"
> > -field5:"world"
> >
> > returning result:
> > field1:123
> > field2:"foo"
> > field3:"bar"
> > field4:"hello"
> > field4:"world"
> >
> > Thank you very much.
> >
> > Best regards.
> >
> > --
> > A.Nazemian
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> 
>

Re: Solr cross core join special condition

2015-10-06 Thread Ali Nazemian

Yeah, but child document transformer is used for nested document inside
single core but I am looking for multiple core result joining. Then it
seems there is not any way to do that right now and it should be developed
somehow. Am I right?
Regards.
On Oct 6, 2015 9:53 PM, "Mikhail Khludnev" 
wrote:

> thus, something like [child]
>
> https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents
> can be developed.
>
> On Tue, Oct 6, 2015 at 6:45 PM, Ali Nazemian 
> wrote:
>
> > Dear Mikhail,
> > Hi,
> > I want to enrich the result.
> > Regards
> > On Oct 6, 2015 7:07 PM, "Mikhail Khludnev" 
> > wrote:
> >
> > > Hello,
> > >
> > > Why do you need sibling core fields? do you facet? or just want to
> enrich
> > > result page with them?
> > >
> > > On Tue, Oct 6, 2015 at 6:04 PM, Ali Nazemian 
> > > wrote:
> > >
> > > > I was wondering how can I overcome this query requirement in Solr
> > 5.2.1:
> > > >
> > > > I have two different Solr cores refer as "core1" and "core2". core1
> > has
> > > > some fields such as field1, field2 and field3 and core2 has some
> other
> > > > fields such as field1, field4 and field5. I am looking for Solr query
> > > which
> > > > can return all of the documents requiring field1, field2, field3,
> > field4
> > > > and field5 with considering some condition on core1.
> > > >
> > > > For example:
> > > > core1:
> > > > -field1:123
> > > > -field2:"foo"
> > > > -field3:"bar"
> > > >
> > > > core2:
> > > > -field1:123
> > > > -field4:"hello"
> > > > -field5:"world"
> > > >
> > > > returning result:
> > > > field1:123
> > > > field2:"foo"
> > > > field3:"bar"
> > > > field4:"hello"
> > > > field4:"world"
> > > >
> > > > Thank you very much.
> > > >
> > > > Best regards.
> > > >
> > > > --
> > > > A.Nazemian
> > > >
> > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > Principal Engineer,
> > > Grid Dynamics
> > >
> > > <http://www.griddynamics.com>
> > > 
> > >
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> <http://www.griddynamics.com>
> 
>

Re: Solr cross core join special condition

2015-10-11 Thread Ali Nazemian

Dear Susheel,
Hi,

I did check the jira issue that you mentioned but it seems its target is
Solr 6! Am I correct? The patch failed for Solr 5.3 due to class not found.
For Solr 5.x should I try to implement something similar myself?

Sincerely yours.


On Wed, Oct 7, 2015 at 7:15 PM, Susheel Kumar  wrote:

> You may want to take a look at new Solr feature of Streaming API &
> Expressions
> https://issues.apache.org/jira/browse/SOLR-7584?filter=12333278
> for making joins between collections.
>
> On Wed, Oct 7, 2015 at 9:42 AM, Ryan Josal  wrote:
>
> > I developed a join transformer plugin that did that (although it didn't
> > flatten the results like that).  The one thing that was painful about it
> is
> > that the TextResponseWriter has references to both the IndexSchema and
> > SolrReturnFields objects for the primary core.  So when you add a
> > SolrDocument from another core it returned the wrong fields.  I worked
> > around that by transforming the SolrDocument to a NamedList.  Then when
> it
> > gets to processing the IndexableFields it uses the wrong IndexSchema, I
> > worked around that by transforming each field to a hard Java object
> > (through the IndexSchema and FieldType of the correct core).  I think it
> > would be great to patch TextResponseWriter with multi core writing
> > abilities, but there is one question, how can it tell which core a
> > SolrDocument or IndexableField is from?  Seems we'd have to add an
> > attribute for that.
> >
> > The other possibly simpler thing to do is execute the join at index time
> > with an update processor.
> >
> > Ryan
> >
> > On Tuesday, October 6, 2015, Mikhail Khludnev <
> mkhlud...@griddynamics.com>
> > wrote:
> >
> > > On Wed, Oct 7, 2015 at 7:05 AM, Ali Nazemian  > > > wrote:
> > >
> > > > it
> > > > seems there is not any way to do that right now and it should be
> > > developed
> > > > somehow. Am I right?
> > > >
> > >
> > > yep
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > Principal Engineer,
> > > Grid Dynamics
> > >
> > > <http://www.griddynamics.com>
> > > >
> > >
> >
>



-- 
A.Nazemian

Re: Solr cross core join special condition

2015-10-12 Thread Ali Nazemian

Thank you very much.

Sincerely yours.

On Mon, Oct 12, 2015 at 6:15 AM, Susheel Kumar 
wrote:

> Yes, Ali.  These are targeted for Solr 6 but you have the option download
> source from trunk, build it and try out these features if that helps in the
> meantime.
>
> Thanks
> Susheel
>
> On Sun, Oct 11, 2015 at 10:01 AM, Ali Nazemian 
> wrote:
>
> > Dear Susheel,
> > Hi,
> >
> > I did check the jira issue that you mentioned but it seems its target is
> > Solr 6! Am I correct? The patch failed for Solr 5.3 due to class not
> found.
> > For Solr 5.x should I try to implement something similar myself?
> >
> > Sincerely yours.
> >
> >
> > On Wed, Oct 7, 2015 at 7:15 PM, Susheel Kumar 
> > wrote:
> >
> > > You may want to take a look at new Solr feature of Streaming API &
> > > Expressions
> > > https://issues.apache.org/jira/browse/SOLR-7584?filter=12333278
> > > for making joins between collections.
> > >
> > > On Wed, Oct 7, 2015 at 9:42 AM, Ryan Josal  wrote:
> > >
> > > > I developed a join transformer plugin that did that (although it
> didn't
> > > > flatten the results like that).  The one thing that was painful about
> > it
> > > is
> > > > that the TextResponseWriter has references to both the IndexSchema
> and
> > > > SolrReturnFields objects for the primary core.  So when you add a
> > > > SolrDocument from another core it returned the wrong fields.  I
> worked
> > > > around that by transforming the SolrDocument to a NamedList.  Then
> when
> > > it
> > > > gets to processing the IndexableFields it uses the wrong
> IndexSchema, I
> > > > worked around that by transforming each field to a hard Java object
> > > > (through the IndexSchema and FieldType of the correct core).  I think
> > it
> > > > would be great to patch TextResponseWriter with multi core writing
> > > > abilities, but there is one question, how can it tell which core a
> > > > SolrDocument or IndexableField is from?  Seems we'd have to add an
> > > > attribute for that.
> > > >
> > > > The other possibly simpler thing to do is execute the join at index
> > time
> > > > with an update processor.
> > > >
> > > > Ryan
> > > >
> > > > On Tuesday, October 6, 2015, Mikhail Khludnev <
> > > mkhlud...@griddynamics.com>
> > > > wrote:
> > > >
> > > > > On Wed, Oct 7, 2015 at 7:05 AM, Ali Nazemian <
> alinazem...@gmail.com
> > > > > > wrote:
> > > > >
> > > > > > it
> > > > > > seems there is not any way to do that right now and it should be
> > > > > developed
> > > > > > somehow. Am I right?
> > > > > >
> > > > >
> > > > > yep
> > > > >
> > > > >
> > > > > --
> > > > > Sincerely yours
> > > > > Mikhail Khludnev
> > > > > Principal Engineer,
> > > > > Grid Dynamics
> > > > >
> > > > > <http://www.griddynamics.com>
> > > > > >
> > > > >
> > > >
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>



-- 
A.Nazemian

Re: Solr cross core join special condition

2015-10-12 Thread Ali Nazemian

Dear Shawn,
Hi,

Since in Yonki's Solr blog <http://yonik.com/solr-5-4/> it is mentioned
that this feature is one of the Solr 5.4 features. I assume it will
back-ported to the next stable release (5.4). Please correct me if it is
the wrong assumption.
Thank you very much.

Sincerely yours.

On Mon, Oct 12, 2015 at 12:29 PM, Ali Nazemian 
wrote:

> Thank you very much.
>
> Sincerely yours.
>
> On Mon, Oct 12, 2015 at 6:15 AM, Susheel Kumar 
> wrote:
>
>> Yes, Ali.  These are targeted for Solr 6 but you have the option download
>> source from trunk, build it and try out these features if that helps in
>> the
>> meantime.
>>
>> Thanks
>> Susheel
>>
>> On Sun, Oct 11, 2015 at 10:01 AM, Ali Nazemian 
>> wrote:
>>
>> > Dear Susheel,
>> > Hi,
>> >
>> > I did check the jira issue that you mentioned but it seems its target is
>> > Solr 6! Am I correct? The patch failed for Solr 5.3 due to class not
>> found.
>> > For Solr 5.x should I try to implement something similar myself?
>> >
>> > Sincerely yours.
>> >
>> >
>> > On Wed, Oct 7, 2015 at 7:15 PM, Susheel Kumar 
>> > wrote:
>> >
>> > > You may want to take a look at new Solr feature of Streaming API &
>> > > Expressions
>> > > https://issues.apache.org/jira/browse/SOLR-7584?filter=12333278
>> > > for making joins between collections.
>> > >
>> > > On Wed, Oct 7, 2015 at 9:42 AM, Ryan Josal  wrote:
>> > >
>> > > > I developed a join transformer plugin that did that (although it
>> didn't
>> > > > flatten the results like that).  The one thing that was painful
>> about
>> > it
>> > > is
>> > > > that the TextResponseWriter has references to both the IndexSchema
>> and
>> > > > SolrReturnFields objects for the primary core.  So when you add a
>> > > > SolrDocument from another core it returned the wrong fields.  I
>> worked
>> > > > around that by transforming the SolrDocument to a NamedList.  Then
>> when
>> > > it
>> > > > gets to processing the IndexableFields it uses the wrong
>> IndexSchema, I
>> > > > worked around that by transforming each field to a hard Java object
>> > > > (through the IndexSchema and FieldType of the correct core).  I
>> think
>> > it
>> > > > would be great to patch TextResponseWriter with multi core writing
>> > > > abilities, but there is one question, how can it tell which core a
>> > > > SolrDocument or IndexableField is from?  Seems we'd have to add an
>> > > > attribute for that.
>> > > >
>> > > > The other possibly simpler thing to do is execute the join at index
>> > time
>> > > > with an update processor.
>> > > >
>> > > > Ryan
>> > > >
>> > > > On Tuesday, October 6, 2015, Mikhail Khludnev <
>> > > mkhlud...@griddynamics.com>
>> > > > wrote:
>> > > >
>> > > > > On Wed, Oct 7, 2015 at 7:05 AM, Ali Nazemian <
>> alinazem...@gmail.com
>> > > > > > wrote:
>> > > > >
>> > > > > > it
>> > > > > > seems there is not any way to do that right now and it should be
>> > > > > developed
>> > > > > > somehow. Am I right?
>> > > > > >
>> > > > >
>> > > > > yep
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Sincerely yours
>> > > > > Mikhail Khludnev
>> > > > > Principal Engineer,
>> > > > > Grid Dynamics
>> > > > >
>> > > > > <http://www.griddynamics.com>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>> >
>> >
>> > --
>> > A.Nazemian
>> >
>>
>
>
>
> --
> A.Nazemian
>



-- 
A.Nazemian

Re: Soft commit and hard commit

2015-11-30 Thread Ali Nazemian

Dear Midas,
Hi,
AFAIK, currently Solr uses virtual memory for storing memory maps.
Therefore using 36GB from 48GB of ram for Java heap is not recommended. As
a rule of thumb do not access more than 25% of your total memory to Solr
JVM in usual situations.
About your main question, setting softcommit and hardcommit for Solr is
highly dependent on your application. A really nice guide for this purpose
is presented by lucidworks, In order to find the best value for softcommit
and hardcommit please follow this guide:
http://lucidworks.com/blog/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/

Best regards.

On Mon, Nov 30, 2015 at 9:48 AM, Midas A  wrote:

> Machine configuration
>
> RAM: 48 GB
> CPU: 8 core
> JVM : 36 GB
>
> We are updating 70 , 000 docs / hr  . what should be our soft commit and
> hard commit time  to get best results.
>
> Current configuration :
>  6 false  autoCommit>
>
>
>  60 
>
> There are no read on master server.
>

-- 
A.Nazemian

Solr 5.2.1 deadlock on commit

2015-12-07 Thread Ali Nazemian

Hi,
There is a while since I have had problem with Solr 5.2.1 and I could not
fix it yet. The only think that is clear to me is when I send bulk update
to Solr the commit thread will be blocked! Here is the thread dump output:

"qtp595445781-8207" prio=10 tid=0x7f0bf68f5800 nid=0x5785 waiting for
monitor entry [0x7f081cf04000]
   java.lang.Thread.State: BLOCKED (on object monitor)
at
org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:608)
- waiting to lock <0x00067ba2e660> (a java.lang.Object)
at
org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1635)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1612)
at
org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:161)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
at
org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:270)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177)
at
org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:110)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
at org.eclipse.jetty.server.Server.handle(Server.java:497)
at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:310)
at
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:257)
at
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:635)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:555)
at java.lang.Thread.run(Thread.java:745)

   Locked ownable synchronizers:
- None

FYI there are lots of blocked thread in thread dump report and Solr becomes
really slow in this case. The temporary solution would be restarting Solr.
But, I am really sick of restarting! I really appreciate if somebody can
help me to solve this problem?

Best regards.

-- 
A.Nazemian

Re: Solr 5.2.1 deadlock on commit

2015-12-08 Thread Ali Nazemian

Dear Emir,
Hi,
There are some cases that I have soft commit in my application. However,
the bulk update part has only hard commit for a bulk of 2500 documents.
Here are some information about the whole indexing/updating scenarios:
- Indexing part uses soft commit.
- In a single update cases soft commit is used.
- For bulk update batch hard commit is used (on 2500 documents)
- Auto hard commit :120 sec
- Auto soft commit: disable

Best regards.


On Tue, Dec 8, 2015 at 12:35 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Ali,
> This thread is blocked because cannot obtain update lock - in this
> particular case when doing soft commit. I am guessing that there others are
> blocked for the same reason. Can you tell us bit more about your setup and
> indexing load and procedure? Do you do explicit commits?
>
> Regards,
> Emir
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>
>
>
> On 08.12.2015 08:16, Ali Nazemian wrote:
>
>> Hi,
>> There is a while since I have had problem with Solr 5.2.1 and I could not
>> fix it yet. The only think that is clear to me is when I send bulk update
>> to Solr the commit thread will be blocked! Here is the thread dump output:
>>
>> "qtp595445781-8207" prio=10 tid=0x7f0bf68f5800 nid=0x5785 waiting for
>> monitor entry [0x7f081cf04000]
>> java.lang.Thread.State: BLOCKED (on object monitor)
>> at
>>
>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:608)
>> - waiting to lock <0x00067ba2e660> (a java.lang.Object)
>> at
>>
>> org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95)
>> at
>>
>> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
>> at
>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1635)
>> at
>>
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1612)
>> at
>>
>> org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:161)
>> at
>>
>> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
>> at
>>
>> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
>> at
>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:270)
>> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177)
>> at
>>
>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98)
>> at
>>
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>> at
>>
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
>> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
>> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)
>> at
>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)
>> at
>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
>> at
>>
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
>> at
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>> at
>>
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>> at
>>
>> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>> at
>>
>> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>> at
>>
>> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1127)
>> at
>> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>> at
>>
>> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>> at
>>
>> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1061)
>> at
>>
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>> at
>>
>> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>> at
>>
>> org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollect

Re: Solr 5.2.1 deadlock on commit

2015-12-08 Thread Ali Nazemian

The indexing load is as follows:
- Around 1000 documents every 5 mins.
- The indexing speed is slow because of the complicated analyzer which is
applied to each document. It takes around 60 seconds to index 1000
documents with applying this analyzer (It is really slow. However, based on
the analyzing part I think it would be acceptable).
- The concurrentsolrclient is used in all the indexing/updating cases.

Regards.

On Tue, Dec 8, 2015 at 6:36 PM, Ali Nazemian  wrote:

> Dear Emir,
> Hi,
> There are some cases that I have soft commit in my application. However,
> the bulk update part has only hard commit for a bulk of 2500 documents.
> Here are some information about the whole indexing/updating scenarios:
> - Indexing part uses soft commit.
> - In a single update cases soft commit is used.
> - For bulk update batch hard commit is used (on 2500 documents)
> - Auto hard commit :120 sec
> - Auto soft commit: disable
>
> Best regards.
>
>
> On Tue, Dec 8, 2015 at 12:35 PM, Emir Arnautovic <
> emir.arnauto...@sematext.com> wrote:
>
>> Hi Ali,
>> This thread is blocked because cannot obtain update lock - in this
>> particular case when doing soft commit. I am guessing that there others are
>> blocked for the same reason. Can you tell us bit more about your setup and
>> indexing load and procedure? Do you do explicit commits?
>>
>> Regards,
>> Emir
>>
>> --
>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>>
>> On 08.12.2015 08:16, Ali Nazemian wrote:
>>
>>> Hi,
>>> There is a while since I have had problem with Solr 5.2.1 and I could not
>>> fix it yet. The only think that is clear to me is when I send bulk update
>>> to Solr the commit thread will be blocked! Here is the thread dump
>>> output:
>>>
>>> "qtp595445781-8207" prio=10 tid=0x7f0bf68f5800 nid=0x5785 waiting for
>>> monitor entry [0x7f081cf04000]
>>> java.lang.Thread.State: BLOCKED (on object monitor)
>>> at
>>>
>>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:608)
>>> - waiting to lock <0x00067ba2e660> (a java.lang.Object)
>>> at
>>>
>>> org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95)
>>> at
>>>
>>> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
>>> at
>>>
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1635)
>>> at
>>>
>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1612)
>>> at
>>>
>>> org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:161)
>>> at
>>>
>>> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
>>> at
>>>
>>> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
>>> at
>>>
>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:270)
>>> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:177)
>>> at
>>>
>>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:98)
>>> at
>>>
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>>> at
>>>
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:143)
>>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:2064)
>>> at org.apache.solr.servlet.HttpSolrCall.execute(HttpSolrCall.java:654)
>>> at org.apache.solr.servlet.HttpSolrCall.call(HttpSolrCall.java:450)
>>> at
>>>
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:227)
>>> at
>>>
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:196)
>>> at
>>>
>>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1652)
>>> at
>>>
>>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>>> at
>>>
>>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>>> at
>>>
>>> org.eclipse.jetty.security.SecurityHan

Re: Solr 5.2.1 deadlock on commit

2015-12-08 Thread Ali Nazemian

I did that already. The situation was worse. The autocommit part makes solr
unavailable.
On Dec 8, 2015 7:13 PM, "Emir Arnautovic" 
wrote:

> Hi Ali,
> Can you try without explicit commits and see if threads will still be
> blocked.
>
> Thanks,
> Emir
>
> On 08.12.2015 16:19, Ali Nazemian wrote:
>
>> The indexing load is as follows:
>> - Around 1000 documents every 5 mins.
>> - The indexing speed is slow because of the complicated analyzer which is
>> applied to each document. It takes around 60 seconds to index 1000
>> documents with applying this analyzer (It is really slow. However, based
>> on
>> the analyzing part I think it would be acceptable).
>> - The concurrentsolrclient is used in all the indexing/updating cases.
>>
>> Regards.
>>
>> On Tue, Dec 8, 2015 at 6:36 PM, Ali Nazemian 
>> wrote:
>>
>> Dear Emir,
>>> Hi,
>>> There are some cases that I have soft commit in my application. However,
>>> the bulk update part has only hard commit for a bulk of 2500 documents.
>>> Here are some information about the whole indexing/updating scenarios:
>>> - Indexing part uses soft commit.
>>> - In a single update cases soft commit is used.
>>> - For bulk update batch hard commit is used (on 2500 documents)
>>> - Auto hard commit :120 sec
>>> - Auto soft commit: disable
>>>
>>> Best regards.
>>>
>>>
>>> On Tue, Dec 8, 2015 at 12:35 PM, Emir Arnautovic <
>>> emir.arnauto...@sematext.com> wrote:
>>>
>>> Hi Ali,
>>>> This thread is blocked because cannot obtain update lock - in this
>>>> particular case when doing soft commit. I am guessing that there others
>>>> are
>>>> blocked for the same reason. Can you tell us bit more about your setup
>>>> and
>>>> indexing load and procedure? Do you do explicit commits?
>>>>
>>>> Regards,
>>>> Emir
>>>>
>>>> --
>>>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>>>> Solr & Elasticsearch Support * http://sematext.com/
>>>>
>>>>
>>>>
>>>> On 08.12.2015 08:16, Ali Nazemian wrote:
>>>>
>>>> Hi,
>>>>> There is a while since I have had problem with Solr 5.2.1 and I could
>>>>> not
>>>>> fix it yet. The only think that is clear to me is when I send bulk
>>>>> update
>>>>> to Solr the commit thread will be blocked! Here is the thread dump
>>>>> output:
>>>>>
>>>>> "qtp595445781-8207" prio=10 tid=0x7f0bf68f5800 nid=0x5785 waiting
>>>>> for
>>>>> monitor entry [0x7f081cf04000]
>>>>>  java.lang.Thread.State: BLOCKED (on object monitor)
>>>>> at
>>>>>
>>>>>
>>>>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:608)
>>>>> - waiting to lock <0x00067ba2e660> (a java.lang.Object)
>>>>> at
>>>>>
>>>>>
>>>>> org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95)
>>>>> at
>>>>>
>>>>>
>>>>> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
>>>>> at
>>>>>
>>>>>
>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1635)
>>>>> at
>>>>>
>>>>>
>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1612)
>>>>> at
>>>>>
>>>>>
>>>>> org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:161)
>>>>> at
>>>>>
>>>>>
>>>>> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
>>>>> at
>>>>>
>>>>>
>>>>> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
>>>>> at
>>>>>
>>>>>
>>>>> org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:270)
>>>>> at org.apache.sol

Re: Solr 5.2.1 deadlock on commit

2015-12-11 Thread Ali Nazemian

I really appreciate if somebody can help me to solve this problem.
Regards.

On Tue, Dec 8, 2015 at 9:22 PM, Ali Nazemian  wrote:

> I did that already. The situation was worse. The autocommit part makes
> solr unavailable.
> On Dec 8, 2015 7:13 PM, "Emir Arnautovic" 
> wrote:
>
>> Hi Ali,
>> Can you try without explicit commits and see if threads will still be
>> blocked.
>>
>> Thanks,
>> Emir
>>
>> On 08.12.2015 16:19, Ali Nazemian wrote:
>>
>>> The indexing load is as follows:
>>> - Around 1000 documents every 5 mins.
>>> - The indexing speed is slow because of the complicated analyzer which is
>>> applied to each document. It takes around 60 seconds to index 1000
>>> documents with applying this analyzer (It is really slow. However, based
>>> on
>>> the analyzing part I think it would be acceptable).
>>> - The concurrentsolrclient is used in all the indexing/updating cases.
>>>
>>> Regards.
>>>
>>> On Tue, Dec 8, 2015 at 6:36 PM, Ali Nazemian 
>>> wrote:
>>>
>>> Dear Emir,
>>>> Hi,
>>>> There are some cases that I have soft commit in my application. However,
>>>> the bulk update part has only hard commit for a bulk of 2500 documents.
>>>> Here are some information about the whole indexing/updating scenarios:
>>>> - Indexing part uses soft commit.
>>>> - In a single update cases soft commit is used.
>>>> - For bulk update batch hard commit is used (on 2500 documents)
>>>> - Auto hard commit :120 sec
>>>> - Auto soft commit: disable
>>>>
>>>> Best regards.
>>>>
>>>>
>>>> On Tue, Dec 8, 2015 at 12:35 PM, Emir Arnautovic <
>>>> emir.arnauto...@sematext.com> wrote:
>>>>
>>>> Hi Ali,
>>>>> This thread is blocked because cannot obtain update lock - in this
>>>>> particular case when doing soft commit. I am guessing that there
>>>>> others are
>>>>> blocked for the same reason. Can you tell us bit more about your setup
>>>>> and
>>>>> indexing load and procedure? Do you do explicit commits?
>>>>>
>>>>> Regards,
>>>>> Emir
>>>>>
>>>>> --
>>>>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>>>>> Solr & Elasticsearch Support * http://sematext.com/
>>>>>
>>>>>
>>>>>
>>>>> On 08.12.2015 08:16, Ali Nazemian wrote:
>>>>>
>>>>> Hi,
>>>>>> There is a while since I have had problem with Solr 5.2.1 and I could
>>>>>> not
>>>>>> fix it yet. The only think that is clear to me is when I send bulk
>>>>>> update
>>>>>> to Solr the commit thread will be blocked! Here is the thread dump
>>>>>> output:
>>>>>>
>>>>>> "qtp595445781-8207" prio=10 tid=0x7f0bf68f5800 nid=0x5785 waiting
>>>>>> for
>>>>>> monitor entry [0x7f081cf04000]
>>>>>>  java.lang.Thread.State: BLOCKED (on object monitor)
>>>>>> at
>>>>>>
>>>>>>
>>>>>> org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:608)
>>>>>> - waiting to lock <0x00067ba2e660> (a java.lang.Object)
>>>>>> at
>>>>>>
>>>>>>
>>>>>> org.apache.solr.update.processor.RunUpdateProcessor.processCommit(RunUpdateProcessorFactory.java:95)
>>>>>> at
>>>>>>
>>>>>>
>>>>>> org.apache.solr.update.processor.UpdateRequestProcessor.processCommit(UpdateRequestProcessor.java:64)
>>>>>> at
>>>>>>
>>>>>>
>>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalCommit(DistributedUpdateProcessor.java:1635)
>>>>>> at
>>>>>>
>>>>>>
>>>>>> org.apache.solr.update.processor.DistributedUpdateProcessor.processCommit(DistributedUpdateProcessor.java:1612)
>>>>>> at
>>>>>>
>>>>>>
>>>>>> org.apache.solr.update.processor.LogUpdateProcessor.processCommit(LogUpdateProcessorFactory.java:161)
>>>>>> at
>>>>>>
>>>>>>
>>>>>>

Re: Solr 5.2.1 deadlock on commit

2015-12-13 Thread Ali Nazemian

Dear Emir,
Hi,
Actually Solr is in a deadlock state it will not accept any new document.
(some of them will store in tlog and some of them not) However, It will
response to the new query requests very slowly. Unfortunately right now I
have not any access to full thread dump. But, as I mentioned, it is full of
thread in blocked state.

P.S: I am using 20 threads for the indexing part. I am suspicious of auto
hard commit part. Since the indexing/updating part is really slow for the
sake of complicate analyzer, it is possible that updating 2500 documents
takes more than 120 seconds so before finishing the first hard commit
second hard commit would arrive same for third and forth and so on.
Therefore it might possible that lots of commit thread would be active at
the same time with lots of documents in memory that are not flushed to disk
yet. However, I am not sure that such scenario could take Solr threads to
deadlock state!
Best regards.

On Fri, Dec 11, 2015 at 1:02 PM, Emir Arnautovic <
emir.arnauto...@sematext.com> wrote:

> Hi Ali,
> Is Solr busy at that time and eventually recover or it is deadlocked? Can
> you provide full thread dump when it happened?
> Do you run only indexing at that time? Is "unavailable" only from indexing
> perspective, or you cannot do anything with Solr?
> Is there any indexing scenario that does not cause this (extreme/useless
> one is without commits)?
> Did you try throttling indexing or changing bulk size?
> How many indexing threads?
>
> Thanks,
> Emir
>
>
> On 11.12.2015 10:06, Ali Nazemian wrote:
>
>> I really appreciate if somebody can help me to solve this problem.
>> Regards.
>>
>> On Tue, Dec 8, 2015 at 9:22 PM, Ali Nazemian 
>> wrote:
>>
>> I did that already. The situation was worse. The autocommit part makes
>>> solr unavailable.
>>> On Dec 8, 2015 7:13 PM, "Emir Arnautovic" 
>>> wrote:
>>>
>>> Hi Ali,
>>>> Can you try without explicit commits and see if threads will still be
>>>> blocked.
>>>>
>>>> Thanks,
>>>> Emir
>>>>
>>>> On 08.12.2015 16:19, Ali Nazemian wrote:
>>>>
>>>> The indexing load is as follows:
>>>>> - Around 1000 documents every 5 mins.
>>>>> - The indexing speed is slow because of the complicated analyzer which
>>>>> is
>>>>> applied to each document. It takes around 60 seconds to index 1000
>>>>> documents with applying this analyzer (It is really slow. However,
>>>>> based
>>>>> on
>>>>> the analyzing part I think it would be acceptable).
>>>>> - The concurrentsolrclient is used in all the indexing/updating cases.
>>>>>
>>>>> Regards.
>>>>>
>>>>> On Tue, Dec 8, 2015 at 6:36 PM, Ali Nazemian 
>>>>> wrote:
>>>>>
>>>>> Dear Emir,
>>>>>
>>>>>> Hi,
>>>>>> There are some cases that I have soft commit in my application.
>>>>>> However,
>>>>>> the bulk update part has only hard commit for a bulk of 2500
>>>>>> documents.
>>>>>> Here are some information about the whole indexing/updating scenarios:
>>>>>> - Indexing part uses soft commit.
>>>>>> - In a single update cases soft commit is used.
>>>>>> - For bulk update batch hard commit is used (on 2500 documents)
>>>>>> - Auto hard commit :120 sec
>>>>>> - Auto soft commit: disable
>>>>>>
>>>>>> Best regards.
>>>>>>
>>>>>>
>>>>>> On Tue, Dec 8, 2015 at 12:35 PM, Emir Arnautovic <
>>>>>> emir.arnauto...@sematext.com> wrote:
>>>>>>
>>>>>> Hi Ali,
>>>>>>
>>>>>>> This thread is blocked because cannot obtain update lock - in this
>>>>>>> particular case when doing soft commit. I am guessing that there
>>>>>>> others are
>>>>>>> blocked for the same reason. Can you tell us bit more about your
>>>>>>> setup
>>>>>>> and
>>>>>>> indexing load and procedure? Do you do explicit commits?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Emir
>>>>>>>
>>>>>>> --
>>>>>>> Monitoring * Alerting * Anomaly Detection * Centralized Log
>>>>>>> Mana

Solr query performace

2015-02-24 Thread Ali Nazemian

Dear all,
Hi,
I was wondering is there any performance comparison available for different
solr queries?
I meant what is the cost of different Solr queries from memory and CPU
points of view? I am looking for a report that could help me in case of
having different alternatives for sending single query to Solr.
Thank you very much.
Best regards.
-- 
A.Nazemian

filtering tfq() function query to specific part of collection not the whole documents

2015-03-01 Thread Ali Nazemian

Hi,
I was wondering is it possible to filter tfq() function query to specific
selection of collection? Suppose I want to count all occurrences of term
"test" in documents with fq=category:2, how can I handle such query with
tfq() function query? It seems applying fq=category:2 in a "select" query
with considering tfq() does not affect tfq(), no matter what is the other
part of my query, tfq() always return the total term frequency for specific
field in the whole collection. So what is the solution for this case?
Best regards.

-- 
A.Nazemian

Custom updateProcessor for purpose of extracting interesting terms at index time

2015-03-23 Thread Ali Nazemian

Dear All,
Hi,
I wrote a customize updateProcessorFactory for the purpose of extracting
interesting terms at index time an putting them in a new field. Since I use
MLT interesting terms for this purpose, I have to make sure that the added
document exists in index or not. If it was indexed before there is no
problem for MLT interesting terms. But if it is a new document I have to
index this document before calling MLT interesting terms.
Here is a small part of my code that ran me into problem:

if (!isDocIndexed(cmd.getIndexedId())) {
  // Do not extract keyword since it is not indexed yet
  super.processAdd(cmd);
  processAdd(cmd);
  return;
}

My problem is the core.getRealtimeSearcher() method does not change after
calling super.processAdd(cmd). Therefore such part of code causes infinite
loop! Would you please guide me how can I make sure that my custom
updateProcessorFactory run at the end of indexing process. (in order to
using MLT interesting terms without having concern about the existence of
document in index.

Best regards.
-- 
A.Nazemian

Re: Custom updateProcessor for purpose of extracting interesting terms at index time

2015-03-23 Thread Ali Nazemian

Dear Alex,
Hi,
I am not sure about what would be the best way of doing such process, Would
you please provide me some detail example about doing that on commit? like
spell checker that you mentioned?
Is is possible to do that using a custom analyzer on a copy field? In order
to use MLT interesting terms I should have access to SolrIndexSearcher. I
am not sure that can I have access to SolrIndexSearcher in analyzer or not?

Best regards.


On Mon, Mar 23, 2015 at 11:51 PM, Alexandre Rafalovitch 
wrote:

> So, for a new document. You want to index the document, then read it,
> then add keywords and index again? This does sound like an infinite
> loop. Not sure there is a solution for this approach.
>
> You sure you cannot do it like spell checker does with compiling a
> side-car index on commit? Or even with some sort of periodic trigger
> and update command issues on previously indexed but not post-processed
> documents.
>
> Regards,
>Alex.
> 
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 23 March 2015 at 15:07, Ali Nazemian  wrote:
> > Dear All,
> > Hi,
> > I wrote a customize updateProcessorFactory for the purpose of extracting
> > interesting terms at index time an putting them in a new field. Since I
> use
> > MLT interesting terms for this purpose, I have to make sure that the
> added
> > document exists in index or not. If it was indexed before there is no
> > problem for MLT interesting terms. But if it is a new document I have to
> > index this document before calling MLT interesting terms.
> > Here is a small part of my code that ran me into problem:
> >
> > if (!isDocIndexed(cmd.getIndexedId())) {
> >   // Do not extract keyword since it is not indexed yet
> >   super.processAdd(cmd);
> >   processAdd(cmd);
> >   return;
> > }
> >
> > My problem is the core.getRealtimeSearcher() method does not change after
> > calling super.processAdd(cmd). Therefore such part of code causes
> infinite
> > loop! Would you please guide me how can I make sure that my custom
> > updateProcessorFactory run at the end of indexing process. (in order to
> > using MLT interesting terms without having concern about the existence of
> > document in index.
> >
> > Best regards.
> > --
> > A.Nazemian
>



-- 
A.Nazemian

filtering indexed documents with multiple filters

2015-04-06 Thread Ali Nazemian

Dear all,
Hi,
I am looking for a way to filtering lucene index with multiple conditions.
For this purpose I checked two different method of filtering search, none
of them work for me:

Using BooleanQuery:

BooleanQuery query = new BooleanQuery();
String lower = "*";
String upper = "*";
for (String fieldName : keywordSourceFields) {
  TermRangeQuery rangeQuery = TermRangeQuery.newStringRange(fieldName,
  lower, upper, true, true);
  query.add(rangeQuery, Occur.MUST);
}
TermRangeQuery rangeQuery = TermRangeQuery.newStringRange(keywordField,
lower, upper, true, true);
query.add(rangeQuery, Occur.MUST_NOT);
try {
  TopDocs results = searcher.search(query, null,
  maxNumDocs);


Using BooleanFilter:

BooleanFilter filter = new BooleanFilter();
String lower = "*";
String upper = "*";
for (String fieldName : keywordSourceFields) {
  TermRangeFilter rangeFilter =
TermRangeFilter.newStringRange(fieldName,
  lower, upper, true, true);
  filter.add(rangeFilter, Occur.MUST_NOT);
}
TermRangeFilter rangeFilter =
TermRangeFilter.newStringRange(keywordField,
lower, upper, true, true);
filter.add(rangeFilter, Occur.MUST);
try {
  TopDocs results = searcher.search(new MatchAllDocsQuery(), filter,
  maxNumDocs);

I was wondering what part of chosen queries are wrong? I am looking for
documents that for each keywordSourceFields, the field has some value AND
also has not value for keyword field. Please guide me through correcting
the corresponding query.

Best regards.

-- 
A.Nazemian

Lucene indexWriter update does not affect Solr search

2015-04-07 Thread Ali Nazemian

I implement a small code for the purpose of extracting some keywords out of
Lucene index. I did implement that using search component. My problem is
when I tried to update Lucene IndexWriter, Solr index which is placed on
top of that, does not affect. As you can see I did the commit part.

BooleanQuery query = new BooleanQuery();
for (String fieldName : keywordSourceFields) {
  TermQuery termQuery = new TermQuery(new Term(fieldName,"N/A"));
  query.add(termQuery, Occur.MUST_NOT);
}
TermQuery termQuery=new TermQuery(new Term(keywordField, "N/A"));
query.add(termQuery, Occur.MUST);
try {
  //Query q= new QueryParser(keywordField, new
StandardAnalyzer()).parse(query.toString());
  TopDocs results = searcher.search(query,
  maxNumDocs);
  ScoreDoc[] hits = results.scoreDocs;
  IndexWriter writer = getLuceneIndexWriter(searcher.getPath());
  for (int i = 0; i < hits.length; i++) {
Document document = searcher.doc(hits[i].doc);
List keywords = keyword.getKeywords(hits[i].doc);
if(keywords.size()>0) document.removeFields(keywordField);
for (String word : keywords) {
  document.add(new StringField(keywordField, word,
Field.Store.YES));
}
String uniqueKey =
searcher.getSchema().getUniqueKeyField().getName();
writer.updateDocument(new Term(uniqueKey,
document.get(uniqueKey)),
document);
  }
  writer.commit();
  writer.forceMerge(1);
  writer.close();
} catch (IOException | SyntaxError e) {
  throw new RuntimeException();
}

Please help me through solving this problem.

-- 
A.Nazemian

Re: Lucene indexWriter update does not affect Solr search

2015-04-07 Thread Ali Nazemian

Dear Upayavira,
Hi,
It is just the part of my code in which caused the problem. I know
searchComponent is not for changing the index, but for the purpose of
extracting document keywords I was forced to hack searchComponent for
extracting keywords and putting them into index.
For more information about why I chose searchComponent at the first place
please follow this link:
https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201503.mbox/browser

Best regards.


On Tue, Apr 7, 2015 at 5:30 PM, Upayavira  wrote:

> What are you trying to do? A search component is not intended for
> updating the index, so it really doesn’t surprise me that you aren’t
> seeing updates.
>
> I’d suggest you describe the problem you are trying to solve before
> proposing solutions.
>
> Upayavira
>
>
> On Tue, Apr 7, 2015, at 01:32 PM, Ali Nazemian wrote:
> > I implement a small code for the purpose of extracting some keywords out
> > of
> > Lucene index. I did implement that using search component. My problem is
> > when I tried to update Lucene IndexWriter, Solr index which is placed on
> > top of that, does not affect. As you can see I did the commit part.
> >
> > BooleanQuery query = new BooleanQuery();
> > for (String fieldName : keywordSourceFields) {
> >   TermQuery termQuery = new TermQuery(new Term(fieldName,"N/A"));
> >   query.add(termQuery, Occur.MUST_NOT);
> > }
> > TermQuery termQuery=new TermQuery(new Term(keywordField, "N/A"));
> > query.add(termQuery, Occur.MUST);
> > try {
> >   //Query q= new QueryParser(keywordField, new
> > StandardAnalyzer()).parse(query.toString());
> >   TopDocs results = searcher.search(query,
> >   maxNumDocs);
> >   ScoreDoc[] hits = results.scoreDocs;
> >   IndexWriter writer = getLuceneIndexWriter(searcher.getPath());
> >   for (int i = 0; i < hits.length; i++) {
> > Document document = searcher.doc(hits[i].doc);
> > List keywords = keyword.getKeywords(hits[i].doc);
> > if(keywords.size()>0) document.removeFields(keywordField);
> > for (String word : keywords) {
> >   document.add(new StringField(keywordField, word,
> > Field.Store.YES));
> > }
> > String uniqueKey =
> > searcher.getSchema().getUniqueKeyField().getName();
> > writer.updateDocument(new Term(uniqueKey,
> > document.get(uniqueKey)),
> > document);
> >   }
> >   writer.commit();
> >   writer.forceMerge(1);
> >   writer.close();
> > } catch (IOException | SyntaxError e) {
> >   throw new RuntimeException();
> > }
> >
> > Please help me through solving this problem.
> >
> > --
> > A.Nazemian
>



-- 
A.Nazemian

Re: Lucene indexWriter update does not affect Solr search

2015-04-07 Thread Ali Nazemian

I did some investigation and found out that the retrieving part of
documents works fine while Solr did not restarted. But the searching part
of documents did not work. After I restarted Solr it seems that the core
corrupted and failed to start! Here is the corresponding log:

org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.(SolrCore.java:896)
at org.apache.solr.core.SolrCore.(SolrCore.java:662)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:513)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:278)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:272)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334)
at java.util.concurrent.FutureTask.run(FutureTask.java:166)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
Caused by: org.apache.solr.common.SolrException: Error opening new searcher
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1604)
at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1716)
at org.apache.solr.core.SolrCore.(SolrCore.java:868)
... 9 more
Caused by: org.apache.lucene.index.IndexNotFoundException: no
segments* file found in
NRTCachingDirectory(MMapDirectory@C:\Users\Ali\workspace\lucene_solr_5_0_0\solr\server\solr\document\data\index
lockFactory=org.apache.lucene.store.SimpleFSLockFactory@3bf76891;
maxCacheMB=48.0 maxMergeSizeMB=4.0): files: [_2_Lucene50_0.doc,
write.lock, _2_Lucene50_0.pos, _2.nvd, _2.fdt, _2_Lucene50_0.tim]
at org.apache.lucene.index.IndexWriter.(IndexWriter.java:821)
at org.apache.solr.update.SolrIndexWriter.(SolrIndexWriter.java:78)
at org.apache.solr.update.SolrIndexWriter.create(SolrIndexWriter.java:65)
at 
org.apache.solr.update.DefaultSolrCoreState.createMainIndexWriter(DefaultSolrCoreState.java:272)
at 
org.apache.solr.update.DefaultSolrCoreState.getIndexWriter(DefaultSolrCoreState.java:115)
at org.apache.solr.core.SolrCore.openNewSearcher(SolrCore.java:1573)
... 11 more

4/7/2015, 6:53:26 PM
ERROR
SolrIndexWriter
SolrIndexWriter was not closed prior to finalize(), indicates a bug
-- POSSIBLE RESOURCE LEAK!!!
4/7/2015, 6:53:26 PM
ERROR
SolrIndexWriter
Error closing IndexWriter
java.lang.NullPointerException
at org.apache.lucene.index.IndexWriter.doFlush(IndexWriter.java:2959)
at org.apache.lucene.index.IndexWriter.flush(IndexWriter.java:2927)
at org.apache.lucene.index.IndexWriter.shutdown(IndexWriter.java:965)
at org.apache.lucene.index.IndexWriter.close(IndexWriter.java:1010)
at org.apache.solr.update.SolrIndexWriter.close(SolrIndexWriter.java:130)
at org.apache.solr.update.SolrIndexWriter.finalize(SolrIndexWriter.java:183)
at java.lang.ref.Finalizer.invokeFinalizeMethod(Native Method)
at java.lang.ref.Finalizer.runFinalizer(Finalizer.java:101)
at java.lang.ref.Finalizer.access$100(Finalizer.java:32)
at java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:190)

There for my guess would be problem with indexing the keywordField and also
problem related to closing the IndexWriter.

On Tue, Apr 7, 2015 at 6:13 PM, Ali Nazemian  wrote:

> Dear Upayavira,
> Hi,
> It is just the part of my code in which caused the problem. I know
> searchComponent is not for changing the index, but for the purpose of
> extracting document keywords I was forced to hack searchComponent for
> extracting keywords and putting them into index.
> For more information about why I chose searchComponent at the first place
> please follow this link:
>
> https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201503.mbox/browser
>
> Best regards.
>
>
> On Tue, Apr 7, 2015 at 5:30 PM, Upayavira  wrote:
>
>> What are you trying to do? A search component is not intended for
>> updating the index, so it really doesn’t surprise me that you aren’t
>> seeing updates.
>>
>> I’d suggest you describe the problem you are trying to solve before
>> proposing solutions.
>>
>> Upayavira
>>
>>
>> On Tue, Apr 7, 2015, at 01:32 PM, Ali Nazemian wrote:
>> > I implement a small code for the purpose of extracting some keywords out
>> > of
>> > Lucene index. I did implement that using search component. My problem is
>> > when I tried to update Lucene IndexWriter, Solr index which is placed on
>> > top of that, does not affect. As you can see I did the commit part.
>> >
>> > BooleanQuery query = new BooleanQuery();
>> > for (String fieldName : keywordSourceFields) {
>> >   TermQuery termQuery = new TermQuery(new
>> Term(fieldName,"N/A"));
>> >   query

Lucene updateDocument does not affect index until restarting solr

2015-04-08 Thread Ali Nazemian

Dear all,
Hi,
As a part of my code I have to update Lucene document. For this purpose I
used writer.updateDocument() method. My problem is the update process is
not affect index until restarting Solr. Would you please tell me what part
of my code is wrong? Or what should I add in order to apply the changes?

RefCounted iw = solrCoreState.getIndexWriter(core);
  try {
IndexWriter writer = iw.get();
FieldType type= new FieldType(StringField.TYPE_STORED);
for (int i = 0; i < hits.length; i++) {
  Document document = searcher.doc(hits[i].doc);
  List keywords = keyword.getKeywords(hits[i].doc);
  if (keywords.size() > 0) document.removeFields(keywordField);
  for (String word : keywords) {
document.add(new Field(keywordField, word, type));
  }
  String uniqueKey =
searcher.getSchema().getUniqueKeyField().getName();
  writer.updateDocument(new Term(uniqueKey,
document.get(uniqueKey)),
  document);
}
writer.commit();
  } finally {
iw.decref();
  }


Best regards.

-- 
A.Nazemian

Problem related to filter on Zero value for DateField

2015-04-14 Thread Ali Nazemian

Dears,
Hi,
I have strange problem with Solr 4.10.x. My problem is when I do searching
on solr Zero date which is "0002-11-30T00:00:00Z" if more than one filter
be considered, the results became invalid. For example consider this
scenario:
When I search for a document with fq=p_date:"0002-11-30T00:00:00Z" Solr
returns three different documents which is right for my Collection. All of
these three documents have same value of "7" for document status. Now If I
search for fq=document_status:7 the same three documents returns which is
also a correct response. But When I do the searching on
fq=focument_status:7&fq=p_date:"0002-11-30T00:00:00Z", Solr returns
nothing! (0 document) While I have not such problem with other date values
beside Solr Zero ("0002-11-30T00:00:00Z"). Please let me know it is a bug
related to Solr or I did something wrong?
Best regards.

-- 
A.Nazemian

Re: Problem related to filter on Zero value for DateField

2015-04-15 Thread Ali Nazemian

Dear Jack,
Hi,
The q parameter is *:* since I just wanted to filter the documents.
Regards.

On Tue, Apr 14, 2015 at 8:07 PM, Jack Krupansky 
wrote:

> What does your main query look like? Normally we don't speak of "searching"
> with the fq parameter - it filters the results, but the actual searching is
> done via the main query with the q parameter.
>
> -- Jack Krupansky
>
> On Tue, Apr 14, 2015 at 4:17 AM, Ali Nazemian 
> wrote:
>
> > Dears,
> > Hi,
> > I have strange problem with Solr 4.10.x. My problem is when I do
> searching
> > on solr Zero date which is "0002-11-30T00:00:00Z" if more than one filter
> > be considered, the results became invalid. For example consider this
> > scenario:
> > When I search for a document with fq=p_date:"0002-11-30T00:00:00Z" Solr
> > returns three different documents which is right for my Collection. All
> of
> > these three documents have same value of "7" for document status. Now If
> I
> > search for fq=document_status:7 the same three documents returns which is
> > also a correct response. But When I do the searching on
> > fq=focument_status:7&fq=p_date:"0002-11-30T00:00:00Z", Solr returns
> > nothing! (0 document) While I have not such problem with other date
> values
> > beside Solr Zero ("0002-11-30T00:00:00Z"). Please let me know it is a bug
> > related to Solr or I did something wrong?
> > Best regards.
> >
> > --
> > A.Nazemian
> >
>



-- 
A.Nazemian

Re: Lucene updateDocument does not affect index until restarting solr

2015-04-15 Thread Ali Nazemian

Dear Chris,
Hi,
Thank you for your response. Actually I implemented a small code for the
purpose of extracting article keywords out of Lucene index on commit,
optimize or calling the specific query. I did implement that using search
component. I know that the searchComponent is not for the purpose of
updating index, but it was suggested in Solr mailing list at the first
place and it seems it is the most possible solution according to Solr
extension points. Anyway for more information about why I chose
searchComponent at the first place please take a look at this
<https://mail-archives.apache.org/mod_mbox/lucene-solr-user/201503.mbox/browser>
link.

Best regards.


On Wed, Apr 15, 2015 at 10:00 PM, Chris Hostetter 
wrote:

>
> the short answer is that you need something to re-open the searcher -- but
> i'm not going to go into specifics on how to do that because...
>
> You are dealing with a VERY low level layer of the lucene/solr code stack
> -- w/o more details on why you've written this particular bit of code (and
> where in the solr stack this code lives) it's hard to give you general
> advice on the best way to proceed and i don't wnat to encourage you along
> a dangerous path when there are likely much
> easier/better/safer/more-supported ways to do what you are trying to do --
> you just need to explain to us what that is.
>
> https://people.apache.org/~hossman/#xyproblem
> XY Problem
>
> Your question appears to be an "XY Problem" ... that is: you are dealing
> with "X", you are assuming "Y" will help you, and you are asking about "Y"
> without giving more details about the "X" so that we can understand the
> full issue.  Perhaps the best solution doesn't involve "Y" at all?
> See Also: http://www.perlmonks.org/index.pl?node_id=542341
>
>
>
>
> : Date: Thu, 9 Apr 2015 01:02:16 +0430
> : From: Ali Nazemian 
> : Reply-To: solr-user@lucene.apache.org
> : To: "solr-user@lucene.apache.org" 
> : Subject: Lucene updateDocument does not affect index until restarting
> solr
> :
> : Dear all,
> : Hi,
> : As a part of my code I have to update Lucene document. For this purpose I
> : used writer.updateDocument() method. My problem is the update process is
> : not affect index until restarting Solr. Would you please tell me what
> part
> : of my code is wrong? Or what should I add in order to apply the changes?
> :
> : RefCounted iw = solrCoreState.getIndexWriter(core);
> :   try {
> : IndexWriter writer = iw.get();
> : FieldType type= new FieldType(StringField.TYPE_STORED);
> : for (int i = 0; i < hits.length; i++) {
> :   Document document = searcher.doc(hits[i].doc);
> :   List keywords = keyword.getKeywords(hits[i].doc);
> :   if (keywords.size() > 0) document.removeFields(keywordField);
> :   for (String word : keywords) {
> : document.add(new Field(keywordField, word, type));
> :   }
> :   String uniqueKey =
> : searcher.getSchema().getUniqueKeyField().getName();
> :   writer.updateDocument(new Term(uniqueKey,
> : document.get(uniqueKey)),
> :   document);
> : }
> : writer.commit();
> :   } finally {
> : iw.decref();
> :   }
> :
> :
> : Best regards.
> :
> : --
> : A.Nazemian
> :
>
> -Hoss
> http://www.lucidworks.com/
>



-- 
A.Nazemian

Date Format Conversion Function Query

2015-06-09 Thread Ali Nazemian

Dear all,
Hi,
I was wondering is there any function query for converting date format in
Solr? If no, how can I implement such function query myself?

-- 
A.Nazemian

Re: Date Format Conversion Function Query

2015-06-09 Thread Ali Nazemian

Dear Erick,
Hi,
Actually I want to convert date format from Geregorian calendar (solr
default) to Perisan calendar. You may ask why i do not do that at client
side? Here is why:

I want to provide a way to extract data from solr in the csv format. I know
that solr has csv ResponseWriter that could be used in this case. But my
problem is that the date format in solr index is provided by Geregorian
calendar and I want to put that in Persian calendar. Therefore I was
thinking of a function query to do that at query time for me.

Regards.

On Tue, Jun 9, 2015 at 10:55 PM, Erick Erickson 
wrote:

> I'm not sure what you're asking for, give us an example input/output pair?
>
> Best,
> Erick
>
> On Tue, Jun 9, 2015 at 8:47 AM, Ali Nazemian 
> wrote:
> > Dear all,
> > Hi,
> > I was wondering is there any function query for converting date format in
> > Solr? If no, how can I implement such function query myself?
> >
> > --
> > A.Nazemian
>

-- 
A.Nazemian

Re: Date Format Conversion Function Query

2015-06-10 Thread Ali Nazemian

Thank you very much.
It seems that document transformer is the perfect extension point for this
conversion. I will try to implement that.
Best regards.

On Wed, Jun 10, 2015 at 3:54 PM, Upayavira  wrote:

> Another technology that might make more sense is a Doc Transformer.
>
> You also specify them in the fl parameter. I would imagine you could
> specify
>
> fl=id,[persian f=gregorian_Date]
>
> See here for more cases:
>
>
> https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents
>
> This does not exist right now, but would make a good contribution to
> Solr itself, I'd say.
>
> Upayavira
>
> On Wed, Jun 10, 2015, at 09:57 AM, Alessandro Benedetti wrote:
> > Erick will correct me if I am wrong but this function query I don't think
> > it exists.
> > But maybe can be a nice contribution.
> > It should take in input a date format and a field and give in response
> > the
> > new formatted Date.
> >
> > The would be simple to use it :
> >
> > fl=id,persian_date:dateFormat("/mm/dd",gregorian_Date)
> >
> > The date format is an example in input is an example.
> >
> > Cheers
> >
> > 2015-06-10 7:24 GMT+01:00 Ali Nazemian :
> >
> > > Dear Erick,
> > > Hi,
> > > Actually I want to convert date format from Geregorian calendar (solr
> > > default) to Perisan calendar. You may ask why i do not do that at
> client
> > > side? Here is why:
> > >
> > > I want to provide a way to extract data from solr in the csv format. I
> know
> > > that solr has csv ResponseWriter that could be used in this case. But
> my
> > > problem is that the date format in solr index is provided by Geregorian
> > > calendar and I want to put that in Persian calendar. Therefore I was
> > > thinking of a function query to do that at query time for me.
> > >
> > > Regards.
> > >
> > > On Tue, Jun 9, 2015 at 10:55 PM, Erick Erickson <
> erickerick...@gmail.com>
> > > wrote:
> > >
> > > > I'm not sure what you're asking for, give us an example input/output
> > > pair?
> > > >
> > > > Best,
> > > > Erick
> > > >
> > > > On Tue, Jun 9, 2015 at 8:47 AM, Ali Nazemian 
> > > > wrote:
> > > > > Dear all,
> > > > > Hi,
> > > > > I was wondering is there any function query for converting date
> format
> > > in
> > > > > Solr? If no, how can I implement such function query myself?
> > > > >
> > > > > --
> > > > > A.Nazemian
> > > >
> > >
> > >
> > >
> > > --
> > > A.Nazemian
> > >
> >
> >
> >
> > --
> > --
> >
> > Benedetti Alessandro
> > Visiting card : http://about.me/alessandro_benedetti
> >
> > "Tyger, tyger burning bright
> > In the forests of the night,
> > What immortal hand or eye
> > Could frame thy fearful symmetry?"
> >
> > William Blake - Songs of Experience -1794 England
>



-- 
A.Nazemian

Extracting article keywords using tf-idf algorithm

2015-07-17 Thread Ali Nazemian

Dear Lucene/Solr developers,
Hi,
I decided to develop a plugin for Solr in order to extract main keywords
from article. Since Solr already did the hard-working for calculating
tf-idf scores I decided to use that for the sake of better performance. I
know that UpdateRequestProcessor is the best suited extension point for
adding keyword value to documents. I also find out that I have not any
access to tf-idf scores inside the UpdateRequestProcessor, because of the
fact that UpdateRequestProcessor chain will be applied before the process
of calculating tf-idf scores. Hence, with consulting with Solr/Lucene
developers I decided to go for searchComponent in order to calculate
keywords based on tf-idf (Lucene Interesting Terms) on commit/optimize.
Unfortunately toward this approach, strange core behavior was observed. For
example sometimes facet wont work on this keyword field or the index
becomes unstable in search results.
I really appreciate if someone help me to make it stable.


NamedList response = new SimpleOrderedMap();
keyword.init(searcher, params);
BooleanQuery query = new BooleanQuery();
for (String fieldName : keywordSourceFields) {
  TermQuery termQuery = new TermQuery(new Term(fieldName, "noval"));
  query.add(termQuery, Occur.MUST_NOT);
}
TermQuery termQuery = new TermQuery(new Term(keywordField, "noval"));
query.add(termQuery, Occur.MUST);
RefCounted iw = null;
IndexWriter writer = null;
try {
  TopDocs results = searcher.search(query, maxNumDocs);
  ScoreDoc[] hits = results.scoreDocs;
  iw = solrCoreState.getIndexWriter(core);
  writer = iw.get();
  FieldType type = new FieldType(StringField.TYPE_STORED);
  for (int i = 0; i < hits.length; i++) {
Document document = searcher.doc(hits[i].doc);
List keywords = keyword.getKeywords(hits[i].doc);
if (keywords.size() > 0) document.removeFields(keywordField);
for (String word : keywords) {
  document.add(new Field(keywordField, word, type));
}
String uniqueKey =
searcher.getSchema().getUniqueKeyField().getName();
writer.updateDocument(new Term(uniqueKey, document.get(uniqueKey)),
document);
  }
  response.add("Number of Selected Docs", results.totalHits);
  writer.commit();
} catch (IOException | SyntaxError e) {
  throw new RuntimeException();
} finally {
  if (iw != null) {
iw.decref();
  }
}


public List getKeywords(int docId) throws SyntaxError {
String[] fields = new String[keywordSourceFields.size()];
List terms = new ArrayList();
fields = keywordSourceFields.toArray(fields);
mlt.setFieldNames(fields);
mlt.setAnalyzer(indexSearcher.getSchema().getIndexAnalyzer());
mlt.setMinTermFreq(minTermFreq);
mlt.setMinDocFreq(minDocFreq);
mlt.setMinWordLen(minWordLen);
mlt.setMaxQueryTerms(maxNumKeywords);
mlt.setMaxNumTokensParsed(maxTokensParsed);
try {

  terms = Arrays.asList(mlt.retrieveInterestingTerms(docId));
} catch (IOException e) {
  LOGGER.error(e.getMessage());
  throw new RuntimeException();
}

return terms;
  }

Best regards.
-- 
A.Nazemian

Solr group query based on the sum aggregation of function query

2016-05-16 Thread Ali Nazemian

Dear Solr users/developers,

Hi,

I have tried to implement the Page and Post relation in single Solr Schema.
In my use case each page has multiple posts. Page and Post fields are as
follows:

Post:{post_content, owner_page_id, document_type}

Page:{page_id, document_type}

Suppose I want to query this single core for the results sorted by the
total number of term frequency for specific term per each Page. First I
though the following query can help me to overcome this query requirement
for term "hello":

http://localhost:8983/solr/document/select?wt=json&indent=true&fl=id,name&q=*:*&group=true&group.field=owner_page_id&sort=termfreq(post_content,%27hello%27)+desc&fl=result:termfreq(post_content_text,%27hello%27),owner_page_id

But, it seems that this query returns the term frequency for single post of
each page and the result is not aggregated for all of the page posts, and I
am looking for the aggregate result.

I would be really grateful if somebody can help me to find the required
query for my requirement.

P.S: I am using Solr 6, so JSON Facet is available for me.

-- 
A.Nazemian

Using SolrCloud with RDBMS or without

2014-05-26 Thread Ali Nazemian

Hi everybody,

I was wondering which scenario (or the combination) would be better for my
application. From the aspect of performance, scalability and high
availability. Here is my application:

Suppose I am going to have more than 10m documents and it grows every day.
(probably in 1 years it reaches to more than 100m docs. I want to use Solr
as tool for indexing these documents but the problem is I have some data
fields that could change frequently. (not too much but it could change)

Scenarios:

1- Using SolrCloud as database for all data. (even the one that could be
changed)

2- Using SolrCloud as database for static data and using RDBMS (such as
oracle) for storing dynamic fields.

3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all
data.

Best regards.

-- 
A.Nazemian

Re: Using SolrCloud with RDBMS or without

2014-05-26 Thread Ali Nazemian

The fact that I ignore Cassandra is because of it seems Cassandra is
perfect when you have too much write operation. In my case it is true that
I have some update operation but for sure read operations are much more
than write ones. By the way there are probably more scenarios for my
application. My question would be which one is probably the best?
Best regards.


On Mon, May 26, 2014 at 6:27 PM, Jack Krupansky wrote:

> You could also consider DataStax Enterprise, which integrates Apache
> Cassandra as the primary database and Solr for indexing and query.
>
> See:
> http://www.datastax.com/what-we-offer/products-services/
> datastax-enterprise
>
> -- Jack Krupansky
>
> -----Original Message- From: Ali Nazemian
> Sent: Monday, May 26, 2014 9:50 AM
> To: solr-user@lucene.apache.org
> Subject: Using SolrCloud with RDBMS or without
>
>
> Hi everybody,
>
> I was wondering which scenario (or the combination) would be better for my
> application. From the aspect of performance, scalability and high
> availability. Here is my application:
>
> Suppose I am going to have more than 10m documents and it grows every day.
> (probably in 1 years it reaches to more than 100m docs. I want to use Solr
> as tool for indexing these documents but the problem is I have some data
> fields that could change frequently. (not too much but it could change)
>
> Scenarios:
>
> 1- Using SolrCloud as database for all data. (even the one that could be
> changed)
>
> 2- Using SolrCloud as database for static data and using RDBMS (such as
> oracle) for storing dynamic fields.
>
> 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all
> data.
>
> Best regards.
>
> --
> A.Nazemian
>



-- 
A.Nazemian

Re: Using SolrCloud with RDBMS or without

2014-05-26 Thread Ali Nazemian

Dear Erick,
Thank you for you reply.
Some parts of documents come from Nutch crawler and the other parts come
from processing those documents.
I really need it to be as fast as possible and 10 hours for indexing is not
acceptable for my application.
Regards.


On Mon, May 26, 2014 at 9:25 PM, Erick Erickson wrote:

> What you haven't told us is where the data comes from. But until
> you put some numbers to it, it's hard to decide.
>
> I tend to prefer storing the data somewhere else, filesystem, whatever
> and indexing to Solr when data changes. Even if that means re-indexing
> the entire corpus. I don't like going to more complicated solutions until
> that proves untenable.
>
> Backup/restore solutions for filesystems, DBs, whatever are are a very
> mature technology, I rely on that first to store my original source.
>
> Now you can re-index at will.
>
> So let's claim your data comes in from some stream somewhere. I'd
> 1> store it to the file system.
> 2> write a program to pull it off the file system and index.
> 3> Your comment about MapReduceIndexerTool is germane. You can re-index
> all that data very quickly. And it'll find files on your file system
> for you too!
>
> But I wouldn't even go there until I'd tried
> indexing my 10M docs straight with SolrJ or similar. If you can index
> your 10M docs
> in 1 hour and, by extrapolation your 100M docs in 10 hours, is that good
> enough?
> I don't know, it's your problem space after all ;). And is it acceptable
> to not
> see changes to the schema until tomorrow morning? If so, there's no need
> to get
> more complicated
>
> Best,
> Erick
>
> On Mon, May 26, 2014 at 9:00 AM, Shawn Heisey  wrote:
> > On 5/26/2014 7:50 AM, Ali Nazemian wrote:
> >> I was wondering which scenario (or the combination) would be better for
> my
> >> application. From the aspect of performance, scalability and high
> >> availability. Here is my application:
> >>
> >> Suppose I am going to have more than 10m documents and it grows every
> day.
> >> (probably in 1 years it reaches to more than 100m docs. I want to use
> Solr
> >> as tool for indexing these documents but the problem is I have some data
> >> fields that could change frequently. (not too much but it could change)
> >
> > Choosing which database software to use to hold your data is a problem
> > with many possible solutions.  Everyone will have a different answer for
> > you.  Each solution has strengths and weaknesses, and in the end, only
> > you can really know what your requirements are.
> >
> >> Scenarios:
> >>
> >> 1- Using SolrCloud as database for all data. (even the one that could be
> >> changed)
> >
> > If you choose to use Solr as a NoSQL, I would strongly recommend that
> > you have two Solr installs.  The first install would be purely for data
> > storage and would have no indexed fields.  If you can get machines with
> > enough RAM, it would also probably be preferable to use a single index
> > (or SolrCloud with one shard) for that install.  The other install would
> > be for searching.  Sharding would not be an issue on that index.  The
> > reason that I make this recommendation is that when you use Solr for
> > searching, you have to do a complete reindex if you change your search
> > schema.  It's difficult to reindex if the search index is also your
> > canonical data source.
> >
> >> 2- Using SolrCloud as database for static data and using RDBMS (such as
> >> oracle) for storing dynamic fields.
> >
> > I don't think it would be a good idea to have two canonical data
> > sources.  Pick one.  As already mentioned, Solr is better as a search
> > technology, serving up pointers to data in another data source, than as
> > a database.
> >
> > If you want to use RDBMS technology, why would you spend all that money
> > on Oracle?  Just use one of the free databases.  Our really large Solr
> > index comes from a database.  At one time that database was in Oracle.
> > When my employer purchased the company with that database, we thought we
> > were obtaining a full Oracle license.  It turns out we weren't.  It
> > would have cost about half a million dollars to buy that license, so we
> > switched to MySQL.
> >
> > Since making that move to MySQL, performance is actually *better*.  The
> > source table for our data has 96 million rows right now, growing at a
> > rate of a few million per year.  This is completely in line with your
> > 100 million document requirement.  F

Re: Using SolrCloud with RDBMS or without

2014-05-26 Thread Ali Nazemian

Dear Shawn,
Hi and thank you for you reply.
Could you please tell me about the performance and scalability of the
mentioned solutions? Suppose I have a SolrCloud with 4 different machine.
Would it scale linearly if I add another 4 machines to that? I mean when
the documents number increases from 10m to 100m documents.
Regards.


On Mon, May 26, 2014 at 8:30 PM, Shawn Heisey  wrote:

> On 5/26/2014 7:50 AM, Ali Nazemian wrote:
> > I was wondering which scenario (or the combination) would be better for
> my
> > application. From the aspect of performance, scalability and high
> > availability. Here is my application:
> >
> > Suppose I am going to have more than 10m documents and it grows every
> day.
> > (probably in 1 years it reaches to more than 100m docs. I want to use
> Solr
> > as tool for indexing these documents but the problem is I have some data
> > fields that could change frequently. (not too much but it could change)
>
> Choosing which database software to use to hold your data is a problem
> with many possible solutions.  Everyone will have a different answer for
> you.  Each solution has strengths and weaknesses, and in the end, only
> you can really know what your requirements are.
>
> > Scenarios:
> >
> > 1- Using SolrCloud as database for all data. (even the one that could be
> > changed)
>
> If you choose to use Solr as a NoSQL, I would strongly recommend that
> you have two Solr installs.  The first install would be purely for data
> storage and would have no indexed fields.  If you can get machines with
> enough RAM, it would also probably be preferable to use a single index
> (or SolrCloud with one shard) for that install.  The other install would
> be for searching.  Sharding would not be an issue on that index.  The
> reason that I make this recommendation is that when you use Solr for
> searching, you have to do a complete reindex if you change your search
> schema.  It's difficult to reindex if the search index is also your
> canonical data source.
>
> > 2- Using SolrCloud as database for static data and using RDBMS (such as
> > oracle) for storing dynamic fields.
>
> I don't think it would be a good idea to have two canonical data
> sources.  Pick one.  As already mentioned, Solr is better as a search
> technology, serving up pointers to data in another data source, than as
> a database.
>
> If you want to use RDBMS technology, why would you spend all that money
> on Oracle?  Just use one of the free databases.  Our really large Solr
> index comes from a database.  At one time that database was in Oracle.
> When my employer purchased the company with that database, we thought we
> were obtaining a full Oracle license.  It turns out we weren't.  It
> would have cost about half a million dollars to buy that license, so we
> switched to MySQL.
>
> Since making that move to MySQL, performance is actually *better*.  The
> source table for our data has 96 million rows right now, growing at a
> rate of a few million per year.  This is completely in line with your
> 100 million document requirement.  For the massive table that feeds
> Solr, we might switch to MongoDB, but that has not been decided yet.
>
> Later we switched from EasyAsk to Solr, a move that has *also* given us
> better performance.  Because both MySQL and Solr are free, we've
> achieved a substantial cost savings.
>
> > 3- Using The integration of SolrCloud and Hadoop (HDFS+MapReduce) for all
> > data.
>
> I have no experience with this technology, but I think that if you are
> thinking about a database on HDFS, you're probably actually talking
> about HBase, the Apache implementation of Google's BigTable.
>
> Thanks,
> Shawn
>
>


-- 
A.Nazemian

solr cross doc join on relational database

2014-05-30 Thread Ali Nazemian

Hi every body,
I was wondering is there any way for using cross doc join on integraion of
one solr core and a relational database.
Suppose I have a table in relational database (my sql) name USER. I want to
keep track of news that each user can have access. Assume news are stored
inside solr and there is no easy way of transferring USER table to solr
(because of so many changes that should be done inside other part of my
application) So my question would be is there any way of having cross doc
join with one document inside Solr and another one inside RDBMS?
Best regards.
-- 
A.Nazemian

Re: solr cross doc join on relational database

2014-05-30 Thread Ali Nazemian

Thank you very much. I will take a look at that.


On Fri, May 30, 2014 at 4:24 PM, Ahmet Arslan  wrote:

> Hi Ali,
>
> I did a similar user filtering by indexing user table once per hour, and
> filtering results by solr query time join query parser.
>
> Assuming there is no easy way to transfer USER table to solr, Solr post
> filtering is the way to :
>
> http://searchhub.org/2012/02/22/custom-security-filtering-in-solr/
>
> You can connect to your database in it, filter according to rights. ( can
> this user see this document?)
>
> /**
>  * Note that this Query implementation can _only_ be used as an fq, not as
> a q (it would need to implement createWeight).
>  */
> public class AreaIsOpenControlQuery extends ExtendedQueryBase implements
> PostFilter {
>
>
>
> On Friday, May 30, 2014 2:26 PM, Ali Nazemian 
> wrote:
>
>
>
> Hi every body,
> I was wondering is there any way for using cross doc join on integraion of
> one solr core and a relational database.
> Suppose I have a table in relational database (my sql) name USER. I want to
> keep track of news that each user can have access. Assume news are stored
> inside solr and there is no easy way of transferring USER table to solr
> (because of so many changes that should be done inside other part of my
> application) So my question would be is there any way of having cross doc
> join with one document inside Solr and another one inside RDBMS?
> Best regards.
> --
> A.Nazemian
>



-- 
A.Nazemian

Document security filtering in distributed solr (with multi shard)

2014-06-17 Thread Ali Nazemian

Dears,
Hi,
I am going to apply customer security filtering for each document per each
user. (using custom profile for each user). I was thinking of adding user
fields to index and using solr join for filtering. But It seems for
distributed solr this is not a solution. Could you please tell me what the
solution would be in this case?
Best regards.

-- 
A.Nazemian

Re: Document security filtering in distributed solr (with multi shard)

2014-06-17 Thread Ali Nazemian

Dear Alexandre,
Yeah I saw that, but what is the best way of doing that from the
performance point of view?
I think of one solution myself:
Suppose we have a RDBMS for users that contains the category and group for
each user. (It could be in hierarchical format) Suppose there is a field
name "security" in solr index that contains the list of each group or
category that is applied to each document. So the query would be filter
only documents that its category or group match the specific one for that
user.
Is this solution works in distributed way? What if we concern about
performance?
Also I was wondering how lucidworks do that?
Best regards.

On Tue, Jun 17, 2014 at 4:08 PM, Alexandre Rafalovitch 
wrote:

> Have you looked at Post Filters? I think this was one of the use cases.
>
> An old article:
> http://java.dzone.com/articles/custom-security-filtering-solr . Google
> search should bring a couple more.
>
> Regards,
>Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
>
> On Tue, Jun 17, 2014 at 6:24 PM, Ali Nazemian 
> wrote:
> > Dears,
> > Hi,
> > I am going to apply customer security filtering for each document per
> each
> > user. (using custom profile for each user). I was thinking of adding user
> > fields to index and using solr join for filtering. But It seems for
> > distributed solr this is not a solution. Could you please tell me what
> the
> > solution would be in this case?
> > Best regards.
> >
> > --
> > A.Nazemian
>

-- 
A.Nazemian

Re: Document security filtering in distributed solr (with multi shard)

2014-06-18 Thread Ali Nazemian

Any idea would be appropriate.



On Tue, Jun 17, 2014 at 5:44 PM, Ali Nazemian  wrote:

> Dear Alexandre,
> Yeah I saw that, but what is the best way of doing that from the
> performance point of view?
> I think of one solution myself:
> Suppose we have a RDBMS for users that contains the category and group for
> each user. (It could be in hierarchical format) Suppose there is a field
> name "security" in solr index that contains the list of each group or
> category that is applied to each document. So the query would be filter
> only documents that its category or group match the specific one for that
> user.
> Is this solution works in distributed way? What if we concern about
> performance?
> Also I was wondering how lucidworks do that?
> Best regards.
>
>
> On Tue, Jun 17, 2014 at 4:08 PM, Alexandre Rafalovitch  > wrote:
>
>> Have you looked at Post Filters? I think this was one of the use cases.
>>
>> An old article:
>> http://java.dzone.com/articles/custom-security-filtering-solr . Google
>> search should bring a couple more.
>>
>> Regards,
>>Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr
>> proficiency
>>
>>
>> On Tue, Jun 17, 2014 at 6:24 PM, Ali Nazemian 
>> wrote:
>> > Dears,
>> > Hi,
>> > I am going to apply customer security filtering for each document per
>> each
>> > user. (using custom profile for each user). I was thinking of adding
>> user
>> > fields to index and using solr join for filtering. But It seems for
>> > distributed solr this is not a solution. Could you please tell me what
>> the
>> > solution would be in this case?
>> > Best regards.
>> >
>> > --
>> > A.Nazemian
>>
>
>
>
> --
> A.Nazemian
>



-- 
A.Nazemian

solr dedup on specific fields

2014-06-30 Thread Ali Nazemian

Hi,
I used solr 4.8 for indexing the web pages that come from nutch. I know
that solr deduplication operation works on uniquekey field. So I set that
to URL field. Everything is OK. except that I want after duplication
detection solr try not to delete all fields of old document. I want some
fields remain unchanged. For example assume I have a data field called
"read" with Boolean value "true" for specific document. I want all fields
of new document overwrites except the value of this field. Is that
possible? How?
Regards.

-- 
A.Nazemian

Re: solr dedup on specific fields

2014-07-01 Thread Ali Nazemian

Any suggestion would be appreciated.
Regards.


On Mon, Jun 30, 2014 at 2:49 PM, Ali Nazemian  wrote:

> Hi,
> I used solr 4.8 for indexing the web pages that come from nutch. I know
> that solr deduplication operation works on uniquekey field. So I set that
> to URL field. Everything is OK. except that I want after duplication
> detection solr try not to delete all fields of old document. I want some
> fields remain unchanged. For example assume I have a data field called
> "read" with Boolean value "true" for specific document. I want all fields
> of new document overwrites except the value of this field. Is that
> possible? How?
> Regards.
>
> --
> A.Nazemian
>



-- 
A.Nazemian

Re: solr dedup on specific fields

2014-07-07 Thread Ali Nazemian

Dears,
Is there any way that I can do that in other way?
I mean if you look at my main problem again you will find out that I have
two types of fields in my documents. 1) The ones that should be overwritten
on duplicates, 2) The ones that should not change during duplicates. So Is
it another way to handle this situation from the first place? I mean using
cross join for example?
Assume I have a document with ID 2 which contains all the fields that can
be overwritten. And another document with ID 2 which contains all fields
that should not change during duplication detection. For selecting all
fields it is enough to do join on ID and for Duplication it is enough to
overwrite just document type 1.
Regards.

On Tue, Jul 1, 2014 at 6:17 PM, Alexandre Rafalovitch 
wrote:

> Well, it's implemented in SignatureUpdateProcessorFactory. Worst case,
> you can clone that code and add your preserve-field functionality.
> Could even be a nice contribution.
>
> Regards,
>Alex.
>
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
>
> On Tue, Jul 1, 2014 at 6:50 PM, Ali Nazemian 
> wrote:
> > Any suggestion would be appreciated.
> > Regards.
> >
> >
> > On Mon, Jun 30, 2014 at 2:49 PM, Ali Nazemian 
> wrote:
> >
> >> Hi,
> >> I used solr 4.8 for indexing the web pages that come from nutch. I know
> >> that solr deduplication operation works on uniquekey field. So I set
> that
> >> to URL field. Everything is OK. except that I want after duplication
> >> detection solr try not to delete all fields of old document. I want some
> >> fields remain unchanged. For example assume I have a data field called
> >> "read" with Boolean value "true" for specific document. I want all
> fields
> >> of new document overwrites except the value of this field. Is that
> >> possible? How?
> >> Regards.
> >>
> >> --
> >> A.Nazemian
> >>
> >
> >
> >
> > --
> > A.Nazemian
>

-- 
A.Nazemian

Re: solr dedup on specific fields

2014-07-07 Thread Ali Nazemian

Updating documents will add some extra time to indexing process. (I send
the documents via apache Nutch) I prefer to make indexing as fast as
possible.


On Mon, Jul 7, 2014 at 12:05 PM, Alexandre Rafalovitch 
wrote:

> Can you use Update operation instead of Create? Then, you can supply
> only the fields that need to be changed and use atomic update to
> preserve the others. But then you will have issues when you _are_
> creating new documents and you do need to store all fields.
>
> Regards,
>Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
>
> On Mon, Jul 7, 2014 at 2:08 PM, Ali Nazemian 
> wrote:
> > Dears,
> > Is there any way that I can do that in other way?
> > I mean if you look at my main problem again you will find out that I have
> > two types of fields in my documents. 1) The ones that should be
> overwritten
> > on duplicates, 2) The ones that should not change during duplicates. So
> Is
> > it another way to handle this situation from the first place? I mean
> using
> > cross join for example?
> > Assume I have a document with ID 2 which contains all the fields that can
> > be overwritten. And another document with ID 2 which contains all fields
> > that should not change during duplication detection. For selecting all
> > fields it is enough to do join on ID and for Duplication it is enough to
> > overwrite just document type 1.
> > Regards.
> >
> >
> > On Tue, Jul 1, 2014 at 6:17 PM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> >
> >> Well, it's implemented in SignatureUpdateProcessorFactory. Worst case,
> >> you can clone that code and add your preserve-field functionality.
> >> Could even be a nice contribution.
> >>
> >> Regards,
> >>Alex.
> >>
> >> Personal website: http://www.outerthoughts.com/
> >> Current project: http://www.solr-start.com/ - Accelerating your Solr
> >> proficiency
> >>
> >>
> >> On Tue, Jul 1, 2014 at 6:50 PM, Ali Nazemian 
> >> wrote:
> >> > Any suggestion would be appreciated.
> >> > Regards.
> >> >
> >> >
> >> > On Mon, Jun 30, 2014 at 2:49 PM, Ali Nazemian 
> >> wrote:
> >> >
> >> >> Hi,
> >> >> I used solr 4.8 for indexing the web pages that come from nutch. I
> know
> >> >> that solr deduplication operation works on uniquekey field. So I set
> >> that
> >> >> to URL field. Everything is OK. except that I want after duplication
> >> >> detection solr try not to delete all fields of old document. I want
> some
> >> >> fields remain unchanged. For example assume I have a data field
> called
> >> >> "read" with Boolean value "true" for specific document. I want all
> >> fields
> >> >> of new document overwrites except the value of this field. Is that
> >> >> possible? How?
> >> >> Regards.
> >> >>
> >> >> --
> >> >> A.Nazemian
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > A.Nazemian
> >>
> >
> >
> >
> > --
> > A.Nazemian
>



-- 
A.Nazemian

Re: solr dedup on specific fields

2014-07-07 Thread Ali Nazemian

Dear Alexande,
What if I use ExternalFileFiled for the fields that I dont want to be
changed? Does that work for me?
Regards.


On Mon, Jul 7, 2014 at 2:05 PM, Alexandre Rafalovitch 
wrote:

> Well, let us know when you figure out a way to satisfy all your
> requirements.
>
> Solr is designed for a full-document replace to be efficient at it's
> primary function (search). Any workaround require some sort of
> sacrifice.
>
> Good luck,
>Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
>
> On Mon, Jul 7, 2014 at 4:32 PM, Ali Nazemian 
> wrote:
> > Updating documents will add some extra time to indexing process. (I send
> > the documents via apache Nutch) I prefer to make indexing as fast as
> > possible.
> >
> >
> > On Mon, Jul 7, 2014 at 12:05 PM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> >
> >> Can you use Update operation instead of Create? Then, you can supply
> >> only the fields that need to be changed and use atomic update to
> >> preserve the others. But then you will have issues when you _are_
> >> creating new documents and you do need to store all fields.
> >>
> >> Regards,
> >>Alex.
> >> Personal website: http://www.outerthoughts.com/
> >> Current project: http://www.solr-start.com/ - Accelerating your Solr
> >> proficiency
> >>
> >>
> >> On Mon, Jul 7, 2014 at 2:08 PM, Ali Nazemian 
> >> wrote:
> >> > Dears,
> >> > Is there any way that I can do that in other way?
> >> > I mean if you look at my main problem again you will find out that I
> have
> >> > two types of fields in my documents. 1) The ones that should be
> >> overwritten
> >> > on duplicates, 2) The ones that should not change during duplicates.
> So
> >> Is
> >> > it another way to handle this situation from the first place? I mean
> >> using
> >> > cross join for example?
> >> > Assume I have a document with ID 2 which contains all the fields that
> can
> >> > be overwritten. And another document with ID 2 which contains all
> fields
> >> > that should not change during duplication detection. For selecting all
> >> > fields it is enough to do join on ID and for Duplication it is enough
> to
> >> > overwrite just document type 1.
> >> > Regards.
> >> >
> >> >
> >> > On Tue, Jul 1, 2014 at 6:17 PM, Alexandre Rafalovitch <
> >> arafa...@gmail.com>
> >> > wrote:
> >> >
> >> >> Well, it's implemented in SignatureUpdateProcessorFactory. Worst
> case,
> >> >> you can clone that code and add your preserve-field functionality.
> >> >> Could even be a nice contribution.
> >> >>
> >> >> Regards,
> >> >>Alex.
> >> >>
> >> >> Personal website: http://www.outerthoughts.com/
> >> >> Current project: http://www.solr-start.com/ - Accelerating your Solr
> >> >> proficiency
> >> >>
> >> >>
> >> >> On Tue, Jul 1, 2014 at 6:50 PM, Ali Nazemian 
> >> >> wrote:
> >> >> > Any suggestion would be appreciated.
> >> >> > Regards.
> >> >> >
> >> >> >
> >> >> > On Mon, Jun 30, 2014 at 2:49 PM, Ali Nazemian <
> alinazem...@gmail.com>
> >> >> wrote:
> >> >> >
> >> >> >> Hi,
> >> >> >> I used solr 4.8 for indexing the web pages that come from nutch. I
> >> know
> >> >> >> that solr deduplication operation works on uniquekey field. So I
> set
> >> >> that
> >> >> >> to URL field. Everything is OK. except that I want after
> duplication
> >> >> >> detection solr try not to delete all fields of old document. I
> want
> >> some
> >> >> >> fields remain unchanged. For example assume I have a data field
> >> called
> >> >> >> "read" with Boolean value "true" for specific document. I want all
> >> >> fields
> >> >> >> of new document overwrites except the value of this field. Is that
> >> >> >> possible? How?
> >> >> >> Regards.
> >> >> >>
> >> >> >> --
> >> >> >> A.Nazemian
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > A.Nazemian
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > A.Nazemian
> >>
> >
> >
> >
> > --
> > A.Nazemian
>



-- 
A.Nazemian

Re: solr dedup on specific fields

2014-07-07 Thread Ali Nazemian

Yeah, unfortunately I want it to be searchable:(



On Mon, Jul 7, 2014 at 2:23 PM, Alexandre Rafalovitch 
wrote:

> It's an interesting thought. I haven't tried those.
>
> But I don't think the EFFs are searchable. Do you need them to be
> searchable?
>
> Regards,
>Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr
> proficiency
>
>
> On Mon, Jul 7, 2014 at 4:48 PM, Ali Nazemian 
> wrote:
> > Dear Alexande,
> > What if I use ExternalFileFiled for the fields that I dont want to be
> > changed? Does that work for me?
> > Regards.
> >
> >
> > On Mon, Jul 7, 2014 at 2:05 PM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> >
> >> Well, let us know when you figure out a way to satisfy all your
> >> requirements.
> >>
> >> Solr is designed for a full-document replace to be efficient at it's
> >> primary function (search). Any workaround require some sort of
> >> sacrifice.
> >>
> >> Good luck,
> >>Alex.
> >> Personal website: http://www.outerthoughts.com/
> >> Current project: http://www.solr-start.com/ - Accelerating your Solr
> >> proficiency
> >>
> >>
> >> On Mon, Jul 7, 2014 at 4:32 PM, Ali Nazemian 
> >> wrote:
> >> > Updating documents will add some extra time to indexing process. (I
> send
> >> > the documents via apache Nutch) I prefer to make indexing as fast as
> >> > possible.
> >> >
> >> >
> >> > On Mon, Jul 7, 2014 at 12:05 PM, Alexandre Rafalovitch <
> >> arafa...@gmail.com>
> >> > wrote:
> >> >
> >> >> Can you use Update operation instead of Create? Then, you can supply
> >> >> only the fields that need to be changed and use atomic update to
> >> >> preserve the others. But then you will have issues when you _are_
> >> >> creating new documents and you do need to store all fields.
> >> >>
> >> >> Regards,
> >> >>Alex.
> >> >> Personal website: http://www.outerthoughts.com/
> >> >> Current project: http://www.solr-start.com/ - Accelerating your Solr
> >> >> proficiency
> >> >>
> >> >>
> >> >> On Mon, Jul 7, 2014 at 2:08 PM, Ali Nazemian 
> >> >> wrote:
> >> >> > Dears,
> >> >> > Is there any way that I can do that in other way?
> >> >> > I mean if you look at my main problem again you will find out that
> I
> >> have
> >> >> > two types of fields in my documents. 1) The ones that should be
> >> >> overwritten
> >> >> > on duplicates, 2) The ones that should not change during
> duplicates.
> >> So
> >> >> Is
> >> >> > it another way to handle this situation from the first place? I
> mean
> >> >> using
> >> >> > cross join for example?
> >> >> > Assume I have a document with ID 2 which contains all the fields
> that
> >> can
> >> >> > be overwritten. And another document with ID 2 which contains all
> >> fields
> >> >> > that should not change during duplication detection. For selecting
> all
> >> >> > fields it is enough to do join on ID and for Duplication it is
> enough
> >> to
> >> >> > overwrite just document type 1.
> >> >> > Regards.
> >> >> >
> >> >> >
> >> >> > On Tue, Jul 1, 2014 at 6:17 PM, Alexandre Rafalovitch <
> >> >> arafa...@gmail.com>
> >> >> > wrote:
> >> >> >
> >> >> >> Well, it's implemented in SignatureUpdateProcessorFactory. Worst
> >> case,
> >> >> >> you can clone that code and add your preserve-field functionality.
> >> >> >> Could even be a nice contribution.
> >> >> >>
> >> >> >> Regards,
> >> >> >>Alex.
> >> >> >>
> >> >> >> Personal website: http://www.outerthoughts.com/
> >> >> >> Current project: http://www.solr-start.com/ - Accelerating your
> Solr
> >> >> >> proficiency
> >> >> >>
> >> >> >>
> >> >> >> On Tue, Jul 1, 2014 at 6:50 PM, Ali Nazemian <
> alinazem...@gmail.com>
> >> >> >> wrote:
> >> >> >> > Any suggestion would be appreciated.
> >> >> >> > Regards.
> >> >> >> >
> >> >> >> >
> >> >> >> > On Mon, Jun 30, 2014 at 2:49 PM, Ali Nazemian <
> >> alinazem...@gmail.com>
> >> >> >> wrote:
> >> >> >> >
> >> >> >> >> Hi,
> >> >> >> >> I used solr 4.8 for indexing the web pages that come from
> nutch. I
> >> >> know
> >> >> >> >> that solr deduplication operation works on uniquekey field. So
> I
> >> set
> >> >> >> that
> >> >> >> >> to URL field. Everything is OK. except that I want after
> >> duplication
> >> >> >> >> detection solr try not to delete all fields of old document. I
> >> want
> >> >> some
> >> >> >> >> fields remain unchanged. For example assume I have a data field
> >> >> called
> >> >> >> >> "read" with Boolean value "true" for specific document. I want
> all
> >> >> >> fields
> >> >> >> >> of new document overwrites except the value of this field. Is
> that
> >> >> >> >> possible? How?
> >> >> >> >> Regards.
> >> >> >> >>
> >> >> >> >> --
> >> >> >> >> A.Nazemian
> >> >> >> >>
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > --
> >> >> >> > A.Nazemian
> >> >> >>
> >> >> >
> >> >> >
> >> >> >
> >> >> > --
> >> >> > A.Nazemian
> >> >>
> >> >
> >> >
> >> >
> >> > --
> >> > A.Nazemian
> >>
> >
> >
> >
> > --
> > A.Nazemian
>



-- 
A.Nazemian

Re: Need of hadoop

2014-07-07 Thread Ali Nazemian

I think this will not improve the performance of indexing but probably it
would be a solution for using HDFS HA with replication factor. But I am not
sure about that.


On Mon, Jul 7, 2014 at 12:53 PM, search engn dev 
wrote:

> Currently i am exploring hadoop with solr, Somewhere it is written as "This
> does not use Hadoop Map-Reduce to process Solr data, rather it only uses
> the
> HDFS filesystem for index and transaction log file storage. " ,
>
> then what is the advantage of using using hadoop over local file system?
> will use of hdfs increase overall performance of searching?
>
> any detailed pointers regarding this will surely help me to understand
> this.
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Need-of-hadoop-tp4145846.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
A.Nazemian

Changing default behavior of solr for overwrite the whole document on uniquekey duplication

2014-07-08 Thread Ali Nazemian

Dears,
Hi,
According to my requirement I need to change the default behavior of Solr
for overwriting the whole document on unique-key duplication. I am going to
change that the overwrite just part of document (some fields) and other
parts of document (other fields) remain unchanged. First of all I need to
know such changing in Solr behavior is possible? Second, I really
appreciate if you can guide me through what class/classes should I consider
for changing that?
Best regards.

-- 
A.Nazemian

Re: Changing default behavior of solr for overwrite the whole document on uniquekey duplication

2014-07-08 Thread Ali Nazemian

Dear Himanshu,
Hi,
You misunderstood what I meant. I am not going to update some field. I am
going to change what Solr do on duplication of uniquekey field. I dont want
to solr overwrite Whole document I just want to overwrite some parts of
document. This situation does not come from user side this is what solr do
to documents with duplicated uniquekey.
Regards.


On Tue, Jul 8, 2014 at 12:29 PM, Himanshu Mehrotra <
himanshu.mehro...@snapdeal.com> wrote:

> Please look at https://wiki.apache.org/solr/Atomic_Updates
>
> This does what you want just update relevant fields.
>
> Thanks,
> Himanshu
>
>
> On Tue, Jul 8, 2014 at 1:09 PM, Ali Nazemian 
> wrote:
>
> > Dears,
> > Hi,
> > According to my requirement I need to change the default behavior of Solr
> > for overwriting the whole document on unique-key duplication. I am going
> to
> > change that the overwrite just part of document (some fields) and other
> > parts of document (other fields) remain unchanged. First of all I need to
> > know such changing in Solr behavior is possible? Second, I really
> > appreciate if you can guide me through what class/classes should I
> consider
> > for changing that?
> > Best regards.
> >
> > --
> > A.Nazemian
> >
>



-- 
A.Nazemian

Re: Changing default behavior of solr for overwrite the whole document on uniquekey duplication

2014-07-10 Thread Ali Nazemian

Thank you very much. Now I understand what was the idea. It is better than
changing Solr. But does performance remain same in this situation?


On Tue, Jul 8, 2014 at 10:43 PM, Chris Hostetter 
wrote:

>
> I think you are missunderstanding what Himanshu is suggesting to you.
>
> You don't need to make lots of big changes ot the internals of solr's code
> to get what you want -- instead you can leverage the Atomic Updates &
> Optimistic Concurrency features of Solr to get the existing internal Solr
> to reject any attempts to add a duplicate documentunless the client code
> sending the document specifies it should be an "update".
>
> This means your client code needs to be a bit more sophisticated, but the
> benefit is that you don't have to try to make complex changes to the
> internals of Solr that may be impossible and/or difficult to
> support/upgrade later.
>
> More details...
>
>
> https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents#UpdatingPartsofDocuments-OptimisticConcurrency
>
> Simplest possible idea based on the basic info you have given so far...
>
> 1) send every doc using _version_=-1
> 2a) if doc update fails with error 409, that means a version of this doc
> already exists
> 2b) resend just the field changes (using "set" atomic
> operation) and specify _version_=1
>
>
>
> : Dear Himanshu,
> : Hi,
> : You misunderstood what I meant. I am not going to update some field. I am
> : going to change what Solr do on duplication of uniquekey field. I dont
> want
> : to solr overwrite Whole document I just want to overwrite some parts of
> : document. This situation does not come from user side this is what solr
> do
> : to documents with duplicated uniquekey.
> : Regards.
> :
> :
> : On Tue, Jul 8, 2014 at 12:29 PM, Himanshu Mehrotra <
> : himanshu.mehro...@snapdeal.com> wrote:
> :
> : > Please look at https://wiki.apache.org/solr/Atomic_Updates
> : >
> : > This does what you want just update relevant fields.
> : >
> : > Thanks,
> : > Himanshu
> : >
> : >
> : > On Tue, Jul 8, 2014 at 1:09 PM, Ali Nazemian 
> : > wrote:
> : >
> : > > Dears,
> : > > Hi,
> : > > According to my requirement I need to change the default behavior of
> Solr
> : > > for overwriting the whole document on unique-key duplication. I am
> going
> : > to
> : > > change that the overwrite just part of document (some fields) and
> other
> : > > parts of document (other fields) remain unchanged. First of all I
> need to
> : > > know such changing in Solr behavior is possible? Second, I really
> : > > appreciate if you can guide me through what class/classes should I
> : > consider
> : > > for changing that?
> : > > Best regards.
> : > >
> : > > --
> : > > A.Nazemian
> : > >
> : >
> :
> :
> :
> : --
> : A.Nazemian
> :
>
> -Hoss
> http://www.lucidworks.com/
>



-- 
A.Nazemian

integrating Accumulo with solr

2014-07-23 Thread Ali Nazemian

Dear All,
Hi,
I was wondering is there anybody out there that tried to integrate Solr
with Accumulo? I was thinking about using Accumulo on top of HDFS and using
Solr to index data inside Accumulo? Do you have any idea how can I do such
integration?

Best regards.

-- 
A.Nazemian

Re: integrating Accumulo with solr

2014-07-24 Thread Ali Nazemian

Dear Joe,
Hi,
I am going to store the crawl web pages in accumulo as the main storage
part of my project and I need to give these data to solr for indexing and
user searches. I need to do some social and web analysis on my data as well
as having some security features. Therefore accumulo is my choice for the
database part and for index and search I am going to use Solr. Would you
please guide me through that?

On Thu, Jul 24, 2014 at 1:28 AM, Joe Gresock  wrote:

> We store data in both Solr and Accumulo -- do you have more details about
> what kind of data and indexing you want?  Is there a reason you're thinking
> of using both databases in particular?
>
>
> On Wed, Jul 23, 2014 at 5:17 AM, Ali Nazemian 
> wrote:
>
> > Dear All,
> > Hi,
> > I was wondering is there anybody out there that tried to integrate Solr
> > with Accumulo? I was thinking about using Accumulo on top of HDFS and
> using
> > Solr to index data inside Accumulo? Do you have any idea how can I do
> such
> > integration?
> >
> > Best regards.
> >
> > --
> > A.Nazemian
> >
>
>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.*-Philippians 4:12-13*
>

-- 
A.Nazemian

Re: integrating Accumulo with solr

2014-07-24 Thread Ali Nazemian

Thank you very much. Nice Idea but how can Solr and Accumulo can be
synchronized in this way?
I know that Solr can be integrated with HDFS and also Accumulo works on the
top of HDFS. So can I use HDFS as integration point? I mean set Solr to use
HDFS as a source of documents as well as the destination of documents.
Regards.


On Thu, Jul 24, 2014 at 4:33 PM, Joe Gresock  wrote:

> Ali,
>
> Sounds like a good choice.  It's pretty standard to store the primary
> storage id as a field in Solr so that you can search the full text in Solr
> and then retrieve the full document elsewhere.
>
> I would recommend creating a document structure in Solr with whatever
> fields you want indexed (most likely as text_en, etc.), and then store a
> "string" field named "content_id", which would be the Accumulo row id that
> you look up with a scan.
>
> One caveat -- Accumulo will be protected at the cell level, but if you need
> your Solr search results to be protected by complex authorization strings
> similar to Accumulo, you will need to write your own QParserPlugin and use
> post filtering:
> http://java.dzone.com/articles/custom-security-filtering-solr
>
> The code you see in that article is written for an earlier version of Solr,
> but it's not too difficult to adjust it for the latest (we've done so in
> our project).  Once you've implemented this, you would store an
> "authorizations" string field in each Solr document, and pass in the
> authorizations that the user has access to in the fq parameter of every
> query.  It's also not too bad to write something that parses the Accumulo
> authorizations string (like A&B&(C|D|E|F)) and interpret it accordingly in
> the QParserPlugin.
>
> This will give you true row level security in Solr and Accumulo, and it
> performs quite well in Solr.
>
> Let me know if you have any other questions.
>
> Joe
>
>
> On Thu, Jul 24, 2014 at 4:07 AM, Ali Nazemian 
> wrote:
>
> > Dear Joe,
> > Hi,
> > I am going to store the crawl web pages in accumulo as the main storage
> > part of my project and I need to give these data to solr for indexing and
> > user searches. I need to do some social and web analysis on my data as
> well
> > as having some security features. Therefore accumulo is my choice for the
> > database part and for index and search I am going to use Solr. Would you
> > please guide me through that?
> >
> >
> >
> > On Thu, Jul 24, 2014 at 1:28 AM, Joe Gresock  wrote:
> >
> > > We store data in both Solr and Accumulo -- do you have more details
> about
> > > what kind of data and indexing you want?  Is there a reason you're
> > thinking
> > > of using both databases in particular?
> > >
> > >
> > > On Wed, Jul 23, 2014 at 5:17 AM, Ali Nazemian 
> > > wrote:
> > >
> > > > Dear All,
> > > > Hi,
> > > > I was wondering is there anybody out there that tried to integrate
> Solr
> > > > with Accumulo? I was thinking about using Accumulo on top of HDFS and
> > > using
> > > > Solr to index data inside Accumulo? Do you have any idea how can I do
> > > such
> > > > integration?
> > > >
> > > > Best regards.
> > > >
> > > > --
> > > > A.Nazemian
> > > >
> > >
> > >
> > >
> > > --
> > > I know what it is to be in need, and I know what it is to have plenty.
>  I
> > > have learned the secret of being content in any and every situation,
> > > whether well fed or hungry, whether living in plenty or in want.  I can
> > do
> > > all this through him who gives me strength.*-Philippians 4:12-13*
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>
>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.*-Philippians 4:12-13*
>



-- 
A.Nazemian

Re: integrating Accumulo with solr

2014-07-24 Thread Ali Nazemian

Dear Jack,
Thank you. I am aware of datastax but I am looking for integrating accumulo
with solr. This is something like what sqrrl guys offer.
Regards.


On Thu, Jul 24, 2014 at 7:27 PM, Jack Krupansky 
wrote:

> If you are not a "true hard-core gunslinger" who is willing to dive in and
> integrate the code yourself, instead you should give serious consideration
> to a product such as DataStax Enterprise that fully integrates and packages
> a NoSQL database (Cassandra) and Solr for search. The security aspects are
> still a work in progress, but certainly headed in the right direction. And
> it has Hadoop and Spark integration as well.
>
> See:
> http://www.datastax.com/what-we-offer/products-services/
> datastax-enterprise
>
> -- Jack Krupansky
>
> -Original Message- From: Ali Nazemian
> Sent: Thursday, July 24, 2014 10:30 AM
> To: solr-user@lucene.apache.org
> Subject: Re: integrating Accumulo with solr
>
>
> Thank you very much. Nice Idea but how can Solr and Accumulo can be
> synchronized in this way?
> I know that Solr can be integrated with HDFS and also Accumulo works on the
> top of HDFS. So can I use HDFS as integration point? I mean set Solr to use
> HDFS as a source of documents as well as the destination of documents.
> Regards.
>
>
> On Thu, Jul 24, 2014 at 4:33 PM, Joe Gresock  wrote:
>
>  Ali,
>>
>> Sounds like a good choice.  It's pretty standard to store the primary
>> storage id as a field in Solr so that you can search the full text in Solr
>> and then retrieve the full document elsewhere.
>>
>> I would recommend creating a document structure in Solr with whatever
>> fields you want indexed (most likely as text_en, etc.), and then store a
>> "string" field named "content_id", which would be the Accumulo row id that
>> you look up with a scan.
>>
>> One caveat -- Accumulo will be protected at the cell level, but if you
>> need
>> your Solr search results to be protected by complex authorization strings
>> similar to Accumulo, you will need to write your own QParserPlugin and use
>> post filtering:
>> http://java.dzone.com/articles/custom-security-filtering-solr
>>
>> The code you see in that article is written for an earlier version of
>> Solr,
>> but it's not too difficult to adjust it for the latest (we've done so in
>> our project).  Once you've implemented this, you would store an
>> "authorizations" string field in each Solr document, and pass in the
>> authorizations that the user has access to in the fq parameter of every
>> query.  It's also not too bad to write something that parses the Accumulo
>> authorizations string (like A&B&(C|D|E|F)) and interpret it accordingly in
>> the QParserPlugin.
>>
>> This will give you true row level security in Solr and Accumulo, and it
>> performs quite well in Solr.
>>
>> Let me know if you have any other questions.
>>
>> Joe
>>
>>
>> On Thu, Jul 24, 2014 at 4:07 AM, Ali Nazemian 
>> wrote:
>>
>> > Dear Joe,
>> > Hi,
>> > I am going to store the crawl web pages in accumulo as the main storage
>> > part of my project and I need to give these data to solr for indexing >
>> and
>> > user searches. I need to do some social and web analysis on my data as
>> well
>> > as having some security features. Therefore accumulo is my choice for >
>> the
>> > database part and for index and search I am going to use Solr. Would you
>> > please guide me through that?
>> >
>> >
>> >
>> > On Thu, Jul 24, 2014 at 1:28 AM, Joe Gresock 
>> wrote:
>> >
>> > > We store data in both Solr and Accumulo -- do you have more details
>> about
>> > > what kind of data and indexing you want?  Is there a reason you're
>> > thinking
>> > > of using both databases in particular?
>> > >
>> > >
>> > > On Wed, Jul 23, 2014 at 5:17 AM, Ali Nazemian 
>> > > wrote:
>> > >
>> > > > Dear All,
>> > > > Hi,
>> > > > I was wondering is there anybody out there that tried to integrate
>> Solr
>> > > > with Accumulo? I was thinking about using Accumulo on top of HDFS >
>> > > and
>> > > using
>> > > > Solr to index data inside Accumulo? Do you have any idea how can I
>> > > > do
>> > > such
>> > > > integration?
>> > > >
>> > > > Best regards.
>> > > >
>> > > > --
>> > > > A.Nazemian
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > I know what it is to be in need, and I know what it is to have plenty.
>>  I
>> > > have learned the secret of being content in any and every situation,
>> > > whether well fed or hungry, whether living in plenty or in want.  I >
>> > can
>> > do
>> > > all this through him who gives me strength.*-Philippians 4:12-13*
>> > >
>> >
>> >
>> >
>> > --
>> > A.Nazemian
>> >
>>
>>
>>
>> --
>> I know what it is to be in need, and I know what it is to have plenty.  I
>> have learned the secret of being content in any and every situation,
>> whether well fed or hungry, whether living in plenty or in want.  I can do
>> all this through him who gives me strength.*-Philippians 4:12-13*
>>
>>
>
>
> --
> A.Nazemian
>



-- 
A.Nazemian

Re: integrating Accumulo with solr

2014-07-25 Thread Ali Nazemian

Dear Jack,
Actually I am going to do benefit-cost analysis for in-house developement
or going for sqrrl support.
Best regards.


On Thu, Jul 24, 2014 at 11:48 PM, Jack Krupansky 
wrote:

> Like I said, you're going to have to be a real, hard-core gunslinger to do
> that well. Sqrrl uses Lucene directly, BTW:
>
> "Full-Text Search: Utilizing open-source Lucene and custom indexing
> methods, Sqrrl Enterprise users can conduct real-time, full-text search
> across data in Sqrrl Enterprise."
>
> See:
> http://sqrrl.com/product/search/
>
> Out of curiosity, why are you not using that integrated Lucene support of
> Sqrrl Enterprise?
>
>
> -- Jack Krupansky
>
> -Original Message- From: Ali Nazemian
> Sent: Thursday, July 24, 2014 3:07 PM
>
> To: solr-user@lucene.apache.org
> Subject: Re: integrating Accumulo with solr
>
> Dear Jack,
> Thank you. I am aware of datastax but I am looking for integrating accumulo
> with solr. This is something like what sqrrl guys offer.
> Regards.
>
>
> On Thu, Jul 24, 2014 at 7:27 PM, Jack Krupansky 
> wrote:
>
>  If you are not a "true hard-core gunslinger" who is willing to dive in and
>> integrate the code yourself, instead you should give serious consideration
>> to a product such as DataStax Enterprise that fully integrates and
>> packages
>> a NoSQL database (Cassandra) and Solr for search. The security aspects are
>> still a work in progress, but certainly headed in the right direction. And
>> it has Hadoop and Spark integration as well.
>>
>> See:
>> http://www.datastax.com/what-we-offer/products-services/
>> datastax-enterprise
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Ali Nazemian
>> Sent: Thursday, July 24, 2014 10:30 AM
>> To: solr-user@lucene.apache.org
>> Subject: Re: integrating Accumulo with solr
>>
>>
>> Thank you very much. Nice Idea but how can Solr and Accumulo can be
>> synchronized in this way?
>> I know that Solr can be integrated with HDFS and also Accumulo works on
>> the
>> top of HDFS. So can I use HDFS as integration point? I mean set Solr to
>> use
>> HDFS as a source of documents as well as the destination of documents.
>> Regards.
>>
>>
>> On Thu, Jul 24, 2014 at 4:33 PM, Joe Gresock  wrote:
>>
>>  Ali,
>>
>>>
>>> Sounds like a good choice.  It's pretty standard to store the primary
>>> storage id as a field in Solr so that you can search the full text in
>>> Solr
>>> and then retrieve the full document elsewhere.
>>>
>>> I would recommend creating a document structure in Solr with whatever
>>> fields you want indexed (most likely as text_en, etc.), and then store a
>>> "string" field named "content_id", which would be the Accumulo row id
>>> that
>>> you look up with a scan.
>>>
>>> One caveat -- Accumulo will be protected at the cell level, but if you
>>> need
>>> your Solr search results to be protected by complex authorization strings
>>> similar to Accumulo, you will need to write your own QParserPlugin and
>>> use
>>> post filtering:
>>> http://java.dzone.com/articles/custom-security-filtering-solr
>>>
>>> The code you see in that article is written for an earlier version of
>>> Solr,
>>> but it's not too difficult to adjust it for the latest (we've done so in
>>> our project).  Once you've implemented this, you would store an
>>> "authorizations" string field in each Solr document, and pass in the
>>> authorizations that the user has access to in the fq parameter of every
>>> query.  It's also not too bad to write something that parses the Accumulo
>>> authorizations string (like A&B&(C|D|E|F)) and interpret it accordingly
>>> in
>>> the QParserPlugin.
>>>
>>> This will give you true row level security in Solr and Accumulo, and it
>>> performs quite well in Solr.
>>>
>>> Let me know if you have any other questions.
>>>
>>> Joe
>>>
>>>
>>> On Thu, Jul 24, 2014 at 4:07 AM, Ali Nazemian 
>>> wrote:
>>>
>>> > Dear Joe,
>>> > Hi,
>>> > I am going to store the crawl web pages in accumulo as the main storage
>>> > part of my project and I need to give these data to solr for indexing >
>>> and
>>> > user searches. I need to do some social and web analysis on my data as
>>> well
>>> > as

Re: integrating Accumulo with solr

2014-07-26 Thread Ali Nazemian

Dear Jack,
Hi,
One more thing to mention: I dont want to use solr or lucence for indexing
accumulo or full text search inside that. I am looking for have both in a
sync mode. I mean import some parts of data to solr for indexing. For this
purpose probably I need something like trigger in RDBMS, I have to define
something (probably with accumulo iterator) to import to solr on inserting
new data.
Regards.

On Fri, Jul 25, 2014 at 12:59 PM, Ali Nazemian 
wrote:

> Dear Jack,
> Actually I am going to do benefit-cost analysis for in-house developement
> or going for sqrrl support.
> Best regards.
>
>
> On Thu, Jul 24, 2014 at 11:48 PM, Jack Krupansky 
> wrote:
>
>> Like I said, you're going to have to be a real, hard-core gunslinger to
>> do that well. Sqrrl uses Lucene directly, BTW:
>>
>> "Full-Text Search: Utilizing open-source Lucene and custom indexing
>> methods, Sqrrl Enterprise users can conduct real-time, full-text search
>> across data in Sqrrl Enterprise."
>>
>> See:
>> http://sqrrl.com/product/search/
>>
>> Out of curiosity, why are you not using that integrated Lucene support of
>> Sqrrl Enterprise?
>>
>>
>> -- Jack Krupansky
>>
>> -Original Message- From: Ali Nazemian
>> Sent: Thursday, July 24, 2014 3:07 PM
>>
>> To: solr-user@lucene.apache.org
>> Subject: Re: integrating Accumulo with solr
>>
>> Dear Jack,
>> Thank you. I am aware of datastax but I am looking for integrating
>> accumulo
>> with solr. This is something like what sqrrl guys offer.
>> Regards.
>>
>>
>> On Thu, Jul 24, 2014 at 7:27 PM, Jack Krupansky 
>> wrote:
>>
>>  If you are not a "true hard-core gunslinger" who is willing to dive in
>>> and
>>> integrate the code yourself, instead you should give serious
>>> consideration
>>> to a product such as DataStax Enterprise that fully integrates and
>>> packages
>>> a NoSQL database (Cassandra) and Solr for search. The security aspects
>>> are
>>> still a work in progress, but certainly headed in the right direction.
>>> And
>>> it has Hadoop and Spark integration as well.
>>>
>>> See:
>>> http://www.datastax.com/what-we-offer/products-services/
>>> datastax-enterprise
>>>
>>> -- Jack Krupansky
>>>
>>> -Original Message- From: Ali Nazemian
>>> Sent: Thursday, July 24, 2014 10:30 AM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: integrating Accumulo with solr
>>>
>>>
>>> Thank you very much. Nice Idea but how can Solr and Accumulo can be
>>> synchronized in this way?
>>> I know that Solr can be integrated with HDFS and also Accumulo works on
>>> the
>>> top of HDFS. So can I use HDFS as integration point? I mean set Solr to
>>> use
>>> HDFS as a source of documents as well as the destination of documents.
>>> Regards.
>>>
>>>
>>> On Thu, Jul 24, 2014 at 4:33 PM, Joe Gresock  wrote:
>>>
>>>  Ali,
>>>
>>>>
>>>> Sounds like a good choice.  It's pretty standard to store the primary
>>>> storage id as a field in Solr so that you can search the full text in
>>>> Solr
>>>> and then retrieve the full document elsewhere.
>>>>
>>>> I would recommend creating a document structure in Solr with whatever
>>>> fields you want indexed (most likely as text_en, etc.), and then store a
>>>> "string" field named "content_id", which would be the Accumulo row id
>>>> that
>>>> you look up with a scan.
>>>>
>>>> One caveat -- Accumulo will be protected at the cell level, but if you
>>>> need
>>>> your Solr search results to be protected by complex authorization
>>>> strings
>>>> similar to Accumulo, you will need to write your own QParserPlugin and
>>>> use
>>>> post filtering:
>>>> http://java.dzone.com/articles/custom-security-filtering-solr
>>>>
>>>> The code you see in that article is written for an earlier version of
>>>> Solr,
>>>> but it's not too difficult to adjust it for the latest (we've done so in
>>>> our project).  Once you've implemented this, you would store an
>>>> "authorizations" string field in each Solr document, and pass in the
>>>> authorizations that the user has access to in the fq parameter of every
>>

Re: integrating Accumulo with solr

2014-07-29 Thread Ali Nazemian

Sure,
Thank you very much for your guide. I think I am not that kind of gunslinger
and probably I will go for another NoSQL that can be integrated with
solr/elastic search much easier:)
Best regards.


On Sun, Jul 27, 2014 at 5:02 PM, Jack Krupansky 
wrote:

> Right, and that's exactly what DataStax Enterprise provides (at great
> engineering effort!) - synchronization of database updates and search
> indexing. Sure, you can do it as well, but that's a significant engineering
> challenge with both sides of the equation, and not a simple "plug and play"
> configuration setting by writing a simple "connector."
>
> But, hey, if you consider yourself one of those "true hard-core
> gunslingers" then you'll be able to code that up in a weekend without any
> of our assistance, right?
>
> In short, synchronizing two data stores is a real challenge. Yes, it is
> doable, but... it is non-trivial. Especially if both stores are distributed
> clusters. Maybe now you can guess why the Sqrrl guys went the Lucene route
> instead of Solr.
>
> I'm certainly not suggesting that it can't be done. Just highlighting the
> challenge of such a task.
>
> Just to be clear, you are referring to "sync mode" and not mere "ETL",
> which people do all the time with batch scripts, Java extraction and
> ingestion connectors, and cron jobs.
>
> Give it a shot and let us know how it works out.
>
>
> -- Jack Krupansky
>
> -Original Message- From: Ali Nazemian
> Sent: Sunday, July 27, 2014 1:20 AM
>
> To: solr-user@lucene.apache.org
> Subject: Re: integrating Accumulo with solr
>
> Dear Jack,
> Hi,
> One more thing to mention: I dont want to use solr or lucence for indexing
> accumulo or full text search inside that. I am looking for have both in a
> sync mode. I mean import some parts of data to solr for indexing. For this
> purpose probably I need something like trigger in RDBMS, I have to define
> something (probably with accumulo iterator) to import to solr on inserting
> new data.
> Regards.
>
> On Fri, Jul 25, 2014 at 12:59 PM, Ali Nazemian 
> wrote:
>
>  Dear Jack,
>> Actually I am going to do benefit-cost analysis for in-house developement
>> or going for sqrrl support.
>> Best regards.
>>
>>
>> On Thu, Jul 24, 2014 at 11:48 PM, Jack Krupansky > >
>> wrote:
>>
>>  Like I said, you're going to have to be a real, hard-core gunslinger to
>>> do that well. Sqrrl uses Lucene directly, BTW:
>>>
>>> "Full-Text Search: Utilizing open-source Lucene and custom indexing
>>> methods, Sqrrl Enterprise users can conduct real-time, full-text search
>>> across data in Sqrrl Enterprise."
>>>
>>> See:
>>> http://sqrrl.com/product/search/
>>>
>>> Out of curiosity, why are you not using that integrated Lucene support of
>>> Sqrrl Enterprise?
>>>
>>>
>>> -- Jack Krupansky
>>>
>>> -Original Message- From: Ali Nazemian
>>> Sent: Thursday, July 24, 2014 3:07 PM
>>>
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: integrating Accumulo with solr
>>>
>>> Dear Jack,
>>> Thank you. I am aware of datastax but I am looking for integrating
>>> accumulo
>>> with solr. This is something like what sqrrl guys offer.
>>> Regards.
>>>
>>>
>>> On Thu, Jul 24, 2014 at 7:27 PM, Jack Krupansky >> >
>>> wrote:
>>>
>>>  If you are not a "true hard-core gunslinger" who is willing to dive in
>>>
>>>> and
>>>> integrate the code yourself, instead you should give serious
>>>> consideration
>>>> to a product such as DataStax Enterprise that fully integrates and
>>>> packages
>>>> a NoSQL database (Cassandra) and Solr for search. The security aspects
>>>> are
>>>> still a work in progress, but certainly headed in the right direction.
>>>> And
>>>> it has Hadoop and Spark integration as well.
>>>>
>>>> See:
>>>> http://www.datastax.com/what-we-offer/products-services/
>>>> datastax-enterprise
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> -Original Message- From: Ali Nazemian
>>>> Sent: Thursday, July 24, 2014 10:30 AM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Re: integrating Accumulo with solr
>>>>
>>>>
>>>> Thank you very much. Nice Idea but how can Solr and Accumulo can

solr over hdfs for accessing/ changing indexes outside solr

2014-08-05 Thread Ali Nazemian

Dear all,
Hi,
I changed solr 4.9 to write index and data on hdfs. Now I am going to
connect to those data from the outside of solr for changing some of the
values. Could somebody please tell me how that is possible? Suppose I am
using Hbase over hdfs for do these changes.
Best regards.

-- 
A.Nazemian

Re: solr over hdfs for accessing/ changing indexes outside solr

2014-08-05 Thread Ali Nazemian

Actually I am going to do some analysis on the solr data using map reduce.
For this purpose it might be needed to change some part of data or add new
fields from outside solr.


On Tue, Aug 5, 2014 at 5:51 PM, Shawn Heisey  wrote:

> On 8/5/2014 7:04 AM, Ali Nazemian wrote:
> > I changed solr 4.9 to write index and data on hdfs. Now I am going to
> > connect to those data from the outside of solr for changing some of the
> > values. Could somebody please tell me how that is possible? Suppose I am
> > using Hbase over hdfs for do these changes.
>
> I don't know how you could safely modify the index without a Lucene
> application or another instance of Solr, but if you do manage to modify
> the index, simply reloading the core or restarting Solr should cause it
> to pick up the changes. Either you would need to make sure that Solr
> never modifies the index, or you would need some way of coordinating
> updates so that Solr and the other application would never try to modify
> the index at the same time.
>
> Thanks,
> Shawn
>
>


-- 
A.Nazemian

Re: solr over hdfs for accessing/ changing indexes outside solr

2014-08-05 Thread Ali Nazemian

Dear Erick,
Hi,
Thank you for you reply. Yeah I am aware that SolrJ is my last option. I
was thinking about raw I/O operation. So according to your reply probably
it is not applicable somehow. What about the Lily project that Michael
mentioned? Is that consider SolrJ too? Are you aware of Cloudera search? I
know they provide an integrated Hadoop ecosystem. Do you know what is their
suggestion?
Best regards.



On Wed, Aug 6, 2014 at 12:28 AM, Erick Erickson 
wrote:

> What you haven't told us is what you mean by "modify the
> index outside Solr". SolrJ? Using raw Lucene? Trying to modify
> things by writing your own codec? Standard Java I/O operations?
> Other?
>
> You could use SolrJ to connect to an existing Solr server and
> both read and modify at will form your M/R jobs. But if you're
> thinking of trying to write/modify the segment files by raw I/O
> operations, good luck! I'm 99.99% certain that's going to cause
> you endless grief.
>
> Best,
> Erick
>
>
> On Tue, Aug 5, 2014 at 9:55 AM, Ali Nazemian 
> wrote:
>
> > Actually I am going to do some analysis on the solr data using map
> reduce.
> > For this purpose it might be needed to change some part of data or add
> new
> > fields from outside solr.
> >
> >
> > On Tue, Aug 5, 2014 at 5:51 PM, Shawn Heisey  wrote:
> >
> > > On 8/5/2014 7:04 AM, Ali Nazemian wrote:
> > > > I changed solr 4.9 to write index and data on hdfs. Now I am going to
> > > > connect to those data from the outside of solr for changing some of
> the
> > > > values. Could somebody please tell me how that is possible? Suppose I
> > am
> > > > using Hbase over hdfs for do these changes.
> > >
> > > I don't know how you could safely modify the index without a Lucene
> > > application or another instance of Solr, but if you do manage to modify
> > > the index, simply reloading the core or restarting Solr should cause it
> > > to pick up the changes. Either you would need to make sure that Solr
> > > never modifies the index, or you would need some way of coordinating
> > > updates so that Solr and the other application would never try to
> modify
> > > the index at the same time.
> > >
> > > Thanks,
> > > Shawn
> > >
> > >
> >
> >
> > --
> > A.Nazemian
> >
>



-- 
A.Nazemian

Re: solr over hdfs for accessing/ changing indexes outside solr

2014-08-05 Thread Ali Nazemian

Dear Erick,
I remembered some times ago, somebody asked about what is the point of
modify Solr to use HDFS for storing indexes. As far as I remember somebody
told him integrating Solr with HDFS has two advantages. 1) having hadoop
replication and HA. 2) using indexes and Solr documents for other purposes
such as Analysis. So why we go for HDFS in the case of analysis if we want
to use SolrJ for this purpose? What is the point?
Regards.


On Wed, Aug 6, 2014 at 8:59 AM, Ali Nazemian  wrote:

> Dear Erick,
> Hi,
> Thank you for you reply. Yeah I am aware that SolrJ is my last option. I
> was thinking about raw I/O operation. So according to your reply probably
> it is not applicable somehow. What about the Lily project that Michael
> mentioned? Is that consider SolrJ too? Are you aware of Cloudera search? I
> know they provide an integrated Hadoop ecosystem. Do you know what is their
> suggestion?
> Best regards.
>
>
>
> On Wed, Aug 6, 2014 at 12:28 AM, Erick Erickson 
> wrote:
>
>> What you haven't told us is what you mean by "modify the
>> index outside Solr". SolrJ? Using raw Lucene? Trying to modify
>> things by writing your own codec? Standard Java I/O operations?
>> Other?
>>
>> You could use SolrJ to connect to an existing Solr server and
>> both read and modify at will form your M/R jobs. But if you're
>> thinking of trying to write/modify the segment files by raw I/O
>> operations, good luck! I'm 99.99% certain that's going to cause
>> you endless grief.
>>
>> Best,
>> Erick
>>
>>
>> On Tue, Aug 5, 2014 at 9:55 AM, Ali Nazemian 
>> wrote:
>>
>> > Actually I am going to do some analysis on the solr data using map
>> reduce.
>> > For this purpose it might be needed to change some part of data or add
>> new
>> > fields from outside solr.
>> >
>> >
>> > On Tue, Aug 5, 2014 at 5:51 PM, Shawn Heisey  wrote:
>> >
>> > > On 8/5/2014 7:04 AM, Ali Nazemian wrote:
>> > > > I changed solr 4.9 to write index and data on hdfs. Now I am going
>> to
>> > > > connect to those data from the outside of solr for changing some of
>> the
>> > > > values. Could somebody please tell me how that is possible? Suppose
>> I
>> > am
>> > > > using Hbase over hdfs for do these changes.
>> > >
>> > > I don't know how you could safely modify the index without a Lucene
>> > > application or another instance of Solr, but if you do manage to
>> modify
>> > > the index, simply reloading the core or restarting Solr should cause
>> it
>> > > to pick up the changes. Either you would need to make sure that Solr
>> > > never modifies the index, or you would need some way of coordinating
>> > > updates so that Solr and the other application would never try to
>> modify
>> > > the index at the same time.
>> > >
>> > > Thanks,
>> > > Shawn
>> > >
>> > >
>> >
>> >
>> > --
>> > A.Nazemian
>> >
>>
>
>
>
> --
> A.Nazemian
>



-- 
A.Nazemian

indexing comments with Apache Solr

2014-08-06 Thread Ali Nazemian

Dear all,
Hi,
I was wondering how can I mange to index comments in solr? suppose I am
going to index a web page that has a content of news and some comments that
are presented by people at the end of this page. How can I index these
comments in solr? consider the fact that I am going to do some analysis on
these comments. For example I want to have such query flexibility for
retrieving all comments that are presented between 24 June 2014 to 24 July
2014! or all the comments that are presented by specific person. Therefore
defining these comment as multi-value field would not be the solution since
in this case such query flexibility is not feasible. So what is you
suggestion about document granularity in this case? Can I consider all of
these comments as a new document inside main document (tree based
structure). What is your suggestion for this case? I think it is a common
case of indexing webpages these days so probably I am not the only one
thinking about this situation. Please share you though and perhaps your
experiences in this condition with me. Thank you very much.

Best regards.

-- 
A.Nazemian

Re: indexing comments with Apache Solr

2014-08-06 Thread Ali Nazemian

Dear Gora,
I think you misunderstood my problem. Actually I used nutch for crawling
websites and my problem is in index side and not crawl side. Suppose page
is fetch and parsed by Nutch and all comments and the date and source of
comments are identified by parsing. Now what can I do for indexing these
comments? What is the document granularity?
Best regards.


On Wed, Aug 6, 2014 at 1:29 PM, Gora Mohanty  wrote:

> On 6 August 2014 14:13, Ali Nazemian  wrote:
> >
> > Dear all,
> > Hi,
> > I was wondering how can I mange to index comments in solr? suppose I am
> > going to index a web page that has a content of news and some comments
> that
> > are presented by people at the end of this page. How can I index these
> > comments in solr? consider the fact that I am going to do some analysis
> on
> > these comments. For example I want to have such query flexibility for
> > retrieving all comments that are presented between 24 June 2014 to 24
> July
> > 2014! or all the comments that are presented by specific person.
> Therefore
> > defining these comment as multi-value field would not be the solution
> since
> > in this case such query flexibility is not feasible. So what is you
> > suggestion about document granularity in this case? Can I consider all of
> > these comments as a new document inside main document (tree based
> > structure). What is your suggestion for this case? I think it is a common
> > case of indexing webpages these days so probably I am not the only one
> > thinking about this situation. Please share you though and perhaps your
> > experiences in this condition with me. Thank you very much.
>
> Parsing a web page, and breaking up parts up for indexing into different
> fields
> is out of the scope of Solr. You might want to look at Apache Nutch which
> can index into Solr, and/or other web crawlers/scrapers.
>
> Regards,
> Gora
>



-- 
A.Nazemian

Re: indexing comments with Apache Solr

2014-08-06 Thread Ali Nazemian

Dear Alexandre,
Hi,
Thank you very much. I think nested document is what I need. Do you have
more information about how can I define such thing in solr schema? Your
mentioned blog post was all about retrieving nested docs.
Best regards.


On Wed, Aug 6, 2014 at 5:16 PM, Alexandre Rafalovitch 
wrote:

> You can index comments as child records. The structure of the Solr
> document should be able to incorporate both parents and children
> fields and you need to index them all together. Then, just search for
> JOIN syntax for nested documents. Also, latest Solr (4.9) has some
> extra functionality that allows you to find all parent pages and then
> expand children pages to match.
>
> E.g.: http://heliosearch.org/expand-block-join/ seems relevant
>
> Regards,
>Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On Wed, Aug 6, 2014 at 11:18 AM, Ali Nazemian 
> wrote:
> > Dear Gora,
> > I think you misunderstood my problem. Actually I used nutch for crawling
> > websites and my problem is in index side and not crawl side. Suppose page
> > is fetch and parsed by Nutch and all comments and the date and source of
> > comments are identified by parsing. Now what can I do for indexing these
> > comments? What is the document granularity?
> > Best regards.
> >
> >
> > On Wed, Aug 6, 2014 at 1:29 PM, Gora Mohanty  wrote:
> >
> >> On 6 August 2014 14:13, Ali Nazemian  wrote:
> >> >
> >> > Dear all,
> >> > Hi,
> >> > I was wondering how can I mange to index comments in solr? suppose I
> am
> >> > going to index a web page that has a content of news and some comments
> >> that
> >> > are presented by people at the end of this page. How can I index these
> >> > comments in solr? consider the fact that I am going to do some
> analysis
> >> on
> >> > these comments. For example I want to have such query flexibility for
> >> > retrieving all comments that are presented between 24 June 2014 to 24
> >> July
> >> > 2014! or all the comments that are presented by specific person.
> >> Therefore
> >> > defining these comment as multi-value field would not be the solution
> >> since
> >> > in this case such query flexibility is not feasible. So what is you
> >> > suggestion about document granularity in this case? Can I consider
> all of
> >> > these comments as a new document inside main document (tree based
> >> > structure). What is your suggestion for this case? I think it is a
> common
> >> > case of indexing webpages these days so probably I am not the only one
> >> > thinking about this situation. Please share you though and perhaps
> your
> >> > experiences in this condition with me. Thank you very much.
> >>
> >> Parsing a web page, and breaking up parts up for indexing into different
> >> fields
> >> is out of the scope of Solr. You might want to look at Apache Nutch
> which
> >> can index into Solr, and/or other web crawlers/scrapers.
> >>
> >> Regards,
> >> Gora
> >>
> >
> >
> >
> > --
> > A.Nazemian
>



-- 
A.Nazemian

Re: solr over hdfs for accessing/ changing indexes outside solr

2014-08-07 Thread Ali Nazemian

Thank you very much. But why we should go for solr distributed with hadoop?
There is already solrCloud which is pretty applicable in the case of big
index. Is there any advantage for sending indexes over map reduce that
solrCloud can not provide?
Regards.


On Wed, Aug 6, 2014 at 9:09 PM, Erick Erickson 
wrote:

> bq: Are you aware of Cloudera search? I know they provide an integrated
> Hadoop ecosystem.
>
> What Cloudera Search does via the MapReduceIndexerTool (MRIT) is create N
> sub-indexes for
> each shard in the M/R paradigm via EmbeddedSolrServer. Eventually, these
> sub-indexes for
> each shard are merged (perhaps through some number of levels) in the reduce
> phase and
> maybe merged into a live Solr instance (--go-live). You'll note that this
> tool requires the
> address of the ZK ensemble from which it can get the network topology,
> configuration files,
> all that rot. If you don't use the --go-live option, the output is still a
> Solr index, it's just that
> the index for each shard is left in a specific directory on HDFS. Being on
> HDFS allows
> this kind of M/R paradigm for massively parallel indexing operations, and
> perhaps massively
> complex analysis.
>
> Nowhere is there any low-level non-Solr manipulation of the indexes.
>
> The Flume fork just writes directly to the Solr nodes. It knows about the
> ZooKeeper
> ensemble and the collection too and communicates via SolrJ I'm pretty sure.
>
> As far as integrating with HDFS, you're right, HA is part of the package.
> As far as using
> the Solr indexes for analysis, well you can write anything you want to use
> the Solr indexes
> from anywhere in the M/R world and have them available from anywhere in the
> cluster. There's
> no real need to even have Solr running, you could use the output from MRIT
> and access the
> sub-shards with the EmbeddedSolrServer if you wanted, leaving out all the
> pesky servlet
> container stuff.
>
> bq: So why we go for HDFS in the case of analysis if we want to use SolrJ
> for this purpose?
> What is the point?
>
> Scale and data access in a nutshell. In the HDFS world, you can scale
> pretty linearly
> with the number of nodes you can rack together.
>
> Frankly though, if your data set is small enough to fit on a single machine
> _and_ you can get
> through your analysis in a reasonable time (reasonable here is up to you),
> then HDFS
> is probably not worth the hassle. But in the big data world where we're
> talking petabyte scale,
> having HDFS as the underpinning opens up possibilities for working on data
> that were
> difficult/impossible with Solr previously.
>
> Best,
> Erick
>
>
>
> On Tue, Aug 5, 2014 at 9:37 PM, Ali Nazemian 
> wrote:
>
> > Dear Erick,
> > I remembered some times ago, somebody asked about what is the point of
> > modify Solr to use HDFS for storing indexes. As far as I remember
> somebody
> > told him integrating Solr with HDFS has two advantages. 1) having hadoop
> > replication and HA. 2) using indexes and Solr documents for other
> purposes
> > such as Analysis. So why we go for HDFS in the case of analysis if we
> want
> > to use SolrJ for this purpose? What is the point?
> > Regards.
> >
> >
> > On Wed, Aug 6, 2014 at 8:59 AM, Ali Nazemian 
> > wrote:
> >
> > > Dear Erick,
> > > Hi,
> > > Thank you for you reply. Yeah I am aware that SolrJ is my last option.
> I
> > > was thinking about raw I/O operation. So according to your reply
> probably
> > > it is not applicable somehow. What about the Lily project that Michael
> > > mentioned? Is that consider SolrJ too? Are you aware of Cloudera
> search?
> > I
> > > know they provide an integrated Hadoop ecosystem. Do you know what is
> > their
> > > suggestion?
> > > Best regards.
> > >
> > >
> > >
> > > On Wed, Aug 6, 2014 at 12:28 AM, Erick Erickson <
> erickerick...@gmail.com
> > >
> > > wrote:
> > >
> > >> What you haven't told us is what you mean by "modify the
> > >> index outside Solr". SolrJ? Using raw Lucene? Trying to modify
> > >> things by writing your own codec? Standard Java I/O operations?
> > >> Other?
> > >>
> > >> You could use SolrJ to connect to an existing Solr server and
> > >> both read and modify at will form your M/R jobs. But if you're
> > >> thinking of trying to write/modify the segment files by raw I/O
> > >> operations, good luck! I'm 99.99% certain that's going to cause
> > >&g

Re: solr over hdfs for accessing/ changing indexes outside solr

2014-08-07 Thread Ali Nazemian

Dear Erick,
Could you please name those problems that SolrCloud can not tackle them
alone? Maybe I need solrCloud+ Hadoop and I am not aware of that yet.
Regards.


On Thu, Aug 7, 2014 at 7:37 PM, Erick Erickson 
wrote:

> If SolrCloud meets your needs, without Hadoop, then
> there's no real reason to introduce the added complexity.
>
> There are a bunch of problems that do _not_ work
> well with SolrCloud over non-Hadoop file systems. For
> those problems, the combination of SolrCloud and Hadoop
> make tackling them possible.
>
> Best,
> Erick
>
>
> On Thu, Aug 7, 2014 at 3:55 AM, Ali Nazemian 
> wrote:
>
> > Thank you very much. But why we should go for solr distributed with
> hadoop?
> > There is already solrCloud which is pretty applicable in the case of big
> > index. Is there any advantage for sending indexes over map reduce that
> > solrCloud can not provide?
> > Regards.
> >
> >
> > On Wed, Aug 6, 2014 at 9:09 PM, Erick Erickson 
> > wrote:
> >
> > > bq: Are you aware of Cloudera search? I know they provide an integrated
> > > Hadoop ecosystem.
> > >
> > > What Cloudera Search does via the MapReduceIndexerTool (MRIT) is
> create N
> > > sub-indexes for
> > > each shard in the M/R paradigm via EmbeddedSolrServer. Eventually,
> these
> > > sub-indexes for
> > > each shard are merged (perhaps through some number of levels) in the
> > reduce
> > > phase and
> > > maybe merged into a live Solr instance (--go-live). You'll note that
> this
> > > tool requires the
> > > address of the ZK ensemble from which it can get the network topology,
> > > configuration files,
> > > all that rot. If you don't use the --go-live option, the output is
> still
> > a
> > > Solr index, it's just that
> > > the index for each shard is left in a specific directory on HDFS. Being
> > on
> > > HDFS allows
> > > this kind of M/R paradigm for massively parallel indexing operations,
> and
> > > perhaps massively
> > > complex analysis.
> > >
> > > Nowhere is there any low-level non-Solr manipulation of the indexes.
> > >
> > > The Flume fork just writes directly to the Solr nodes. It knows about
> the
> > > ZooKeeper
> > > ensemble and the collection too and communicates via SolrJ I'm pretty
> > sure.
> > >
> > > As far as integrating with HDFS, you're right, HA is part of the
> package.
> > > As far as using
> > > the Solr indexes for analysis, well you can write anything you want to
> > use
> > > the Solr indexes
> > > from anywhere in the M/R world and have them available from anywhere in
> > the
> > > cluster. There's
> > > no real need to even have Solr running, you could use the output from
> > MRIT
> > > and access the
> > > sub-shards with the EmbeddedSolrServer if you wanted, leaving out all
> the
> > > pesky servlet
> > > container stuff.
> > >
> > > bq: So why we go for HDFS in the case of analysis if we want to use
> SolrJ
> > > for this purpose?
> > > What is the point?
> > >
> > > Scale and data access in a nutshell. In the HDFS world, you can scale
> > > pretty linearly
> > > with the number of nodes you can rack together.
> > >
> > > Frankly though, if your data set is small enough to fit on a single
> > machine
> > > _and_ you can get
> > > through your analysis in a reasonable time (reasonable here is up to
> > you),
> > > then HDFS
> > > is probably not worth the hassle. But in the big data world where we're
> > > talking petabyte scale,
> > > having HDFS as the underpinning opens up possibilities for working on
> > data
> > > that were
> > > difficult/impossible with Solr previously.
> > >
> > > Best,
> > > Erick
> > >
> > >
> > >
> > > On Tue, Aug 5, 2014 at 9:37 PM, Ali Nazemian 
> > > wrote:
> > >
> > > > Dear Erick,
> > > > I remembered some times ago, somebody asked about what is the point
> of
> > > > modify Solr to use HDFS for storing indexes. As far as I remember
> > > somebody
> > > > told him integrating Solr with HDFS has two advantages. 1) having
> > hadoop
> > > > replication and HA. 2) using indexes and Solr documents for other
> > > purposes
> > > > such as Analysis. So

Send nested doc with solrJ

2014-09-09 Thread Ali Nazemian

Dear all,
Hi,
I was wondering how can I use solrJ for sending nested document to solr?
Unfortunately I did not find any tutorial for this purpose. I really
appreciate if you can guide me through that. Thank you very much.
Best regards.

-- 
A.Nazemian

boosting words from specific list

2014-09-28 Thread Ali Nazemian

Dear all,
Hi,
I was wondering how can I implement solr boosting words from specific list
of important words? I mean I want to have a list of important words and
tell solr to score documents based on the weighted sum of these words. For
example let word "school" has weight of 2 and word "president" has the
weight of 5. In this case a doc with 2 "school" words and 3 "president"
words will has the total score of 19! I want to sort documents based on
this score. How such procedure is possible in solr? Thank you very much.
Best regards.

-- 
A.Nazemian

solrJ bug related to solrJ 4.10 for having both incremental partial update and child document on the same solr document!

2014-09-29 Thread Ali Nazemian

Dear all,
Hi,
Right now I face with the strange problem related to solJ client:
When I use only incremental partial update. The incremental partial update
works fine. When I use only the add child documents. It works perfectly and
the child documents added successfully. But when I have both of them in
solrInputDocument the adding child documents did not work. I think that the
solr.add(document) method can not work when you have both incremental
partial update and child document in your solr document. So probably it is
a bug related to solj. Would you please consider this situation?
Thank you very much.
Best regards.

-- 
A.Nazemian

Re: solrJ bug related to solrJ 4.10 for having both incremental partial update and child document on the same solr document!

2014-09-29 Thread Ali Nazemian

I also check both solr log and solr console. There is no error inside that,
it seems that every thing is fine! But actually there is not any child
document after executing process.


On Mon, Sep 29, 2014 at 1:47 PM, Ali Nazemian  wrote:

> Dear all,
> Hi,
> Right now I face with the strange problem related to solJ client:
> When I use only incremental partial update. The incremental partial update
> works fine. When I use only the add child documents. It works perfectly and
> the child documents added successfully. But when I have both of them in
> solrInputDocument the adding child documents did not work. I think that the
> solr.add(document) method can not work when you have both incremental
> partial update and child document in your solr document. So probably it is
> a bug related to solj. Would you please consider this situation?
> Thank you very much.
> Best regards.
>
> --
> A.Nazemian
>



-- 
A.Nazemian

Re: boosting words from specific list

2014-09-30 Thread Ali Nazemian

Dear Koji,
Hi,
Thank you very much.
Do you know any example code for UpdateRequestProcessor? Anything would be
appreciated.
Best regards.

On Tue, Sep 30, 2014 at 3:41 AM, Koji Sekiguchi  wrote:

> Hi Ali,
>
> I don't think Solr has such function OOTB. One way I can think of is that
> you can implement UpdateRequestProcessor. In processAdd() method of
> the UpdateRequestProcessor, as you can read field values, you can calculate
> the total score and copy the total score to a field e.g. total_score.
> Then you can sort the query result on total_score field when you query.
>
> Koji
> --
> http://soleami.com/blog/comparing-document-classification-functions-of-
> lucene-and-mahout.html
>
>
> (2014/09/29 4:25), Ali Nazemian wrote:
>
>> Dear all,
>> Hi,
>> I was wondering how can I implement solr boosting words from specific list
>> of important words? I mean I want to have a list of important words and
>> tell solr to score documents based on the weighted sum of these words. For
>> example let word "school" has weight of 2 and word "president" has the
>> weight of 5. In this case a doc with 2 "school" words and 3 "president"
>> words will has the total score of 19! I want to sort documents based on
>> this score. How such procedure is possible in solr? Thank you very much.
>> Best regards.
>>
>>
>
>
>


-- 
A.Nazemian

Re: boosting words from specific list

2014-09-30 Thread Ali Nazemian

Dear Koji,
Also would you please tell me how can I access the term frequency for each
word? Should I do a word count on content or Is it possible to have access
to reverse index information to make the process more efficient? I dont
want to add too much time to the time of indexing documents.

On Tue, Sep 30, 2014 at 7:07 PM, Ali Nazemian  wrote:

> Dear Koji,
> Hi,
> Thank you very much.
> Do you know any example code for UpdateRequestProcessor? Anything would be
> appreciated.
> Best regards.
>
> On Tue, Sep 30, 2014 at 3:41 AM, Koji Sekiguchi 
> wrote:
>
>> Hi Ali,
>>
>> I don't think Solr has such function OOTB. One way I can think of is that
>> you can implement UpdateRequestProcessor. In processAdd() method of
>> the UpdateRequestProcessor, as you can read field values, you can
>> calculate
>> the total score and copy the total score to a field e.g. total_score.
>> Then you can sort the query result on total_score field when you query.
>>
>> Koji
>> --
>> http://soleami.com/blog/comparing-document-classification-functions-of-
>> lucene-and-mahout.html
>>
>>
>> (2014/09/29 4:25), Ali Nazemian wrote:
>>
>>> Dear all,
>>> Hi,
>>> I was wondering how can I implement solr boosting words from specific
>>> list
>>> of important words? I mean I want to have a list of important words and
>>> tell solr to score documents based on the weighted sum of these words.
>>> For
>>> example let word "school" has weight of 2 and word "president" has the
>>> weight of 5. In this case a doc with 2 "school" words and 3 "president"
>>> words will has the total score of 19! I want to sort documents based on
>>> this score. How such procedure is possible in solr? Thank you very much.
>>> Best regards.
>>>
>>>
>>
>>
>>
>
>
> --
> A.Nazemian
>



-- 
A.Nazemian

Re: solrJ bug related to solrJ 4.10 for having both incremental partial update and child document on the same solr document!

2014-10-02 Thread Ali Nazemian

Did anybody test that?
Best regards.

On Mon, Sep 29, 2014 at 2:05 PM, Ali Nazemian  wrote:

> I also check both solr log and solr console. There is no error inside
> that, it seems that every thing is fine! But actually there is not any
> child document after executing process.
>
>
> On Mon, Sep 29, 2014 at 1:47 PM, Ali Nazemian 
> wrote:
>
>> Dear all,
>> Hi,
>> Right now I face with the strange problem related to solJ client:
>> When I use only incremental partial update. The incremental partial
>> update works fine. When I use only the add child documents. It works
>> perfectly and the child documents added successfully. But when I have both
>> of them in solrInputDocument the adding child documents did not work. I
>> think that the solr.add(document) method can not work when you have both
>> incremental partial update and child document in your solr document. So
>> probably it is a bug related to solj. Would you please consider this
>> situation?
>> Thank you very much.
>> Best regards.
>>
>> --
>> A.Nazemian
>>
>
>
>
> --
> A.Nazemian
>



-- 
A.Nazemian

duplicate unique key after partial update in solr 4.10

2014-10-06 Thread Ali Nazemian

Dear all,
Hi,
I am going to do partial update on a field that has not any value. Suppose
I have a document with document id (unique key) '12345' and field
"read_flag" which does not index at the first place. So the read_flag field
for this document has not any value. After I did partial update to this
document to set "read_flag"="true", I faced strange problem. Next time I
indexed same document with same values I saw two different version of
document with id '12345' in solr. One of them with read_flag=true and
another one without read_flag field! I dont want to have duplicate
documents (as it should not to be because of unique_key id). Would you
please tell me what caused such problem?
Best regards.

-- 
A.Nazemian

Re: duplicate unique key after partial update in solr 4.10

2014-10-06 Thread Ali Nazemian

Dear Alex,
Hi,
LOL, yeah I am sure. You can test it yourself. I did that on default schema
too. The results are same!
Regards.

On Mon, Oct 6, 2014 at 4:20 PM, Alexandre Rafalovitch 
wrote:

> A stupid question: Are you sure that what schema thinks your uniqueId
> is - is the uniqueId in your setup? Also, that you are not somehow
> using the flags to tell Solr to ignore duplicates?
>
> Regards,
>Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 6 October 2014 03:40, Ali Nazemian  wrote:
> > Dear all,
> > Hi,
> > I am going to do partial update on a field that has not any value.
> Suppose
> > I have a document with document id (unique key) '12345' and field
> > "read_flag" which does not index at the first place. So the read_flag
> field
> > for this document has not any value. After I did partial update to this
> > document to set "read_flag"="true", I faced strange problem. Next time I
> > indexed same document with same values I saw two different version of
> > document with id '12345' in solr. One of them with read_flag=true and
> > another one without read_flag field! I dont want to have duplicate
> > documents (as it should not to be because of unique_key id). Would you
> > please tell me what caused such problem?
> > Best regards.
> >
> > --
> > A.Nazemian
>



-- 
A.Nazemian

Re: duplicate unique key after partial update in solr 4.10

2014-10-06 Thread Ali Nazemian

The list of docs before do partial update:

product01
car
product

part01
wheels
part


part02
engine
part


part03
brakes
part



product02
truck
product

part04
wheels
part


part05
flaps
part



The list of docs after doing partial update of field read_flag for document
"product01":

product01
car
product
true

part01
wheels
part


part02
engine
part


part03
brakes
part



product02
truck
product

part04
wheels
part


part05
flaps
part



The list of documents after sending same documents again. (it should
overwrite on the last one because of duplicate IDs)
   
product01
car
product
true
  
  
product01
car
product

part01
wheels
part


part02
engine
part


part03
brakes
part



product02
truck
product

part04
wheels
part


part05
flaps
part



But as you can see there are two different version of documents with the
same ID (which is product01).

Regards.

On Mon, Oct 6, 2014 at 8:18 PM, Alexandre Rafalovitch 
wrote:

> Can you upload the update documents then (into a Gist or similar).
> Just so that people didn't have to re-imagine exact steps. Because, if
> it fully checks out, it might be a bug and the next step would be
> creating a JIRA ticket.
>
> Regards,
>Alex.
> Personal: http://www.outerthoughts.com/ and @arafalov
> Solr resources and newsletter: http://www.solr-start.com/ and @solrstart
> Solr popularizers community: https://www.linkedin.com/groups?gid=6713853
>
>
> On 6 October 2014 11:23, Ali Nazemian  wrote:
> > Dear Alex,
> > Hi,
> > LOL, yeah I am sure. You can test it yourself. I did that on default
> schema
> > too. The results are same!
> > Regards.
> >
> > On Mon, Oct 6, 2014 at 4:20 PM, Alexandre Rafalovitch <
> arafa...@gmail.com>
> > wrote:
> >
> >> A stupid question: Are you sure that what schema thinks your uniqueId
> >> is - is the uniqueId in your setup? Also, that you are not somehow
> >> using the flags to tell Solr to ignore duplicates?
> >>
> >> Regards,
> >>Alex.
> >> Personal: http://www.outerthoughts.com/ and @arafalov
> >> Solr resources and newsletter: http://www.solr-start.com/ and
> @solrstart
> >> Solr popularizers community:
> https://www.linkedin.com/groups?gid=6713853
> >>
> >>
> >> On 6 October 2014 03:40, Ali Nazemian  wrote:
> >> > Dear all,
> >> > Hi,
> >> > I am going to do partial update on a field that has not any value.
> >> Suppose
> >> > I have a document with document id (unique key) '12345' and field
> >> > "read_flag" which does not index at the first place. So the read_flag
> >> field
> >> > for this document has not any value. After I did partial update to
> this
> >> > document to set "read_flag"="true", I faced strange problem. Next
> time I
> >> > indexed same document with same values I saw two different version of
> >> > document with id '12345' in solr. One of them with read_flag=true and
> >> > another one without read_flag field! I dont want to have duplicate
> >> > documents (as it should not to be because of unique_key id). Would you
> >> > please tell me what caused such problem?
> >> > Best regards.
> >> >
> >> > --
> >> > A.Nazemian
> >>
> >
> >
> >
> > --
> > A.Nazemian
>



-- 
A.Nazemian

import solr source to eclipse

2014-10-12 Thread Ali Nazemian

Hi,
I am going to import solr source code to eclipse for some development
purpose. Unfortunately every tutorial that I found for this purpose is
outdated and did not work. So would you please give me some hint about how
can I import solr source code to eclipse?
Thank you very much.

-- 
A.Nazemian

Re: import solr source to eclipse

2014-10-13 Thread Ali Nazemian

Thank you very much for your guides but how can I run solr server inside
eclipse?
Best regards.

On Mon, Oct 13, 2014 at 8:02 PM, Rajani Maski  wrote:

> Hi,
>
> The best tutorial for setting up Solr[solr 4.7] in eclipse/intellij  is
> documented in Solr In Action book, Apendix A, *Working with the Solr
> codebase*
>
>
> On Mon, Oct 13, 2014 at 6:45 AM, Tomás Fernández Löbbe <
> tomasflo...@gmail.com> wrote:
>
> > The way I do this:
> > From a terminal:
> > svn checkout https://svn.apache.org/repos/asf/lucene/dev/trunk/
> > lucene-solr-trunk
> > cd lucene-solr-trunk
> > ant eclipse
> >
> > ... And then, from your Eclipse "import existing java project", and
> select
> > the directory where you placed lucene-solr-trunk
> >
> > On Sun, Oct 12, 2014 at 7:09 AM, Ali Nazemian 
> > wrote:
> >
> > > Hi,
> > > I am going to import solr source code to eclipse for some development
> > > purpose. Unfortunately every tutorial that I found for this purpose is
> > > outdated and did not work. So would you please give me some hint about
> > how
> > > can I import solr source code to eclipse?
> > > Thank you very much.
> > >
> > > --
> > > A.Nazemian
> > >
> >
>



-- 
A.Nazemian

mark solr documents as duplicates on hashing the combination of some fields

2014-10-14 Thread Ali Nazemian

Dear all,
Hi,
I was wondering how can I mark some documents as duplicate (just marking
for future usage not deleting) based on the hash combination of some
fields? Suppose I have 2 fields name "url" and "title" I want to create
hash based on url+title and send it to another field name "signature". If I
do that using solr dedup, it will be resulted to deleting duplicate
documents! So it is not applicable for my situation. Thank you very much.
Best regards.

-- 
A.Nazemian

having Solr deduplication and partial update

2014-10-14 Thread Ali Nazemian

Hi,
I was wondering how can I have both solr deduplication and partial update.
I found out that due to some reasons you can not rely on solr deduplication
when you try to update a document partially! It seems that when you do
partial update on some field- even if that field does not consider as
duplication field- solr signature created by deduplication will be
inapplicable! Is there anyway I can have both deduplication and partial
update?
Thank you very much.

-- 
A.Nazemian

Re: mark solr documents as duplicates on hashing the combination of some fields

2014-10-22 Thread Ali Nazemian

The problem is when I partially update some fields of document. The
signature becomes useless! Even if the updated fields are not included in
the signatureField!
Regards.

On Wed, Oct 22, 2014 at 12:44 AM, Chris Hostetter 
wrote:

>
> you can still use the SignatureUpdateProcessorFactory for your usecase,
> just don't configure teh signatureField to be the same as your uniqueKey
> field.
>
> configure some othe fieldname (ie "signature") instead.
>
>
> : Date: Tue, 14 Oct 2014 12:08:26 +0330
> : From: Ali Nazemian 
> : Reply-To: solr-user@lucene.apache.org
> : To: "solr-user@lucene.apache.org" 
> : Subject: mark solr documents as duplicates on hashing the combination of
> some
> : fields
> :
> : Dear all,
> : Hi,
> : I was wondering how can I mark some documents as duplicate (just marking
> : for future usage not deleting) based on the hash combination of some
> : fields? Suppose I have 2 fields name "url" and "title" I want to create
> : hash based on url+title and send it to another field name "signature".
> If I
> : do that using solr dedup, it will be resulted to deleting duplicate
> : documents! So it is not applicable for my situation. Thank you very much.
> : Best regards.
> :
> : --
> : A.Nazemian
> :
>
> -Hoss
> http://www.lucidworks.com/
>



-- 
A.Nazemian

Re: mark solr documents as duplicates on hashing the combination of some fields

2014-10-22 Thread Ali Nazemian

I meant signature will be broken. For example suppose the destination of
hash function for signature fields are "sig". After each partial update it
becomes: "00"!

On Wed, Oct 22, 2014 at 2:59 PM, Alexandre Rafalovitch 
wrote:

> What do you mean by 'useless' specifically on the business level?
>
> Regards,
>  Alex
> On 22/10/2014 7:27 am, "Ali Nazemian"  wrote:
>
> > The problem is when I partially update some fields of document. The
> > signature becomes useless! Even if the updated fields are not included in
> > the signatureField!
> > Regards.
> >
> > On Wed, Oct 22, 2014 at 12:44 AM, Chris Hostetter <
> > hossman_luc...@fucit.org>
> > wrote:
> >
> > >
> > > you can still use the SignatureUpdateProcessorFactory for your usecase,
> > > just don't configure teh signatureField to be the same as your
> uniqueKey
> > > field.
> > >
> > > configure some othe fieldname (ie "signature") instead.
> > >
> > >
> > > : Date: Tue, 14 Oct 2014 12:08:26 +0330
> > > : From: Ali Nazemian 
> > > : Reply-To: solr-user@lucene.apache.org
> > > : To: "solr-user@lucene.apache.org" 
> > > : Subject: mark solr documents as duplicates on hashing the combination
> > of
> > > some
> > > : fields
> > > :
> > > : Dear all,
> > > : Hi,
> > > : I was wondering how can I mark some documents as duplicate (just
> > marking
> > > : for future usage not deleting) based on the hash combination of some
> > > : fields? Suppose I have 2 fields name "url" and "title" I want to
> create
> > > : hash based on url+title and send it to another field name
> "signature".
> > > If I
> > > : do that using solr dedup, it will be resulted to deleting duplicate
> > > : documents! So it is not applicable for my situation. Thank you very
> > much.
> > > : Best regards.
> > > :
> > > : --
> > > : A.Nazemian
> > > :
> > >
> > > -Hoss
> > > http://www.lucidworks.com/
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>



-- 
A.Nazemian

Hardware requirement for 500 million documents

2015-01-04 Thread Ali Nazemian

Hi,
I was wondering what is the hardware requirement for indexing 500 million
documents in Solr? Suppose maximum number of concurrent users in peak time
would be 20.
Thank you very much.

-- 
A.Nazemian

Extending solr analysis in index time

2015-01-11 Thread Ali Nazemian

Hi everybody,

I am going to add some analysis to Solr at the index time. Here is what I
am considering in my mind:
Suppose I have two different fields for Solr schema, field "a" and field
"b". I am going to use the created reverse index in a way that some terms
are considered as important ones and tell lucene to calculate a value based
on these terms frequency per each document. For example let the word
"hello" considered as important word with the weight of "2.0". Suppose the
term frequency for this word at field "a" is 3 and at field "b" is 6 for
document 1. Therefor the score value would be 2*3+(2*6)^2. I want to
calculate this score based on these fields and put it in the index for
retrieving. My question would be how can I do such thing? First I did
consider using term component for calculating this value from outside and
put it back to Solr index, but it seems it is not efficient enough.

Thank you very much.
Best regards.

-- 
A.Nazemian

Re: Extending solr analysis in index time

2015-01-11 Thread Ali Nazemian

Dear Jack,
Hi,
I think you misunderstood my need. I dont want to change the default
scoring behavior of Lucene (tf-idf) I just want to have another field to do
sorting for some specific queries (not all the search business), however I
am aware of Lucene payload.
Thank you very much.

On Sun, Jan 11, 2015 at 7:15 PM, Jack Krupansky 
wrote:

> You would do that with a custom similarity (scoring) class. That's an
> expert feature. In fact a SUPER-expert feature.
>
> Start by completely familiarizing yourself with how TF*IDF  similarity
> already works:
>
> http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
>
> And to use your custom similarity class in Solr:
>
> https://cwiki.apache.org/confluence/display/solr/Other+Schema+Elements#OtherSchemaElements-Similarity
>
>
> -- Jack Krupansky
>
> On Sun, Jan 11, 2015 at 9:04 AM, Ali Nazemian 
> wrote:
>
> > Hi everybody,
> >
> > I am going to add some analysis to Solr at the index time. Here is what I
> > am considering in my mind:
> > Suppose I have two different fields for Solr schema, field "a" and field
> > "b". I am going to use the created reverse index in a way that some terms
> > are considered as important ones and tell lucene to calculate a value
> based
> > on these terms frequency per each document. For example let the word
> > "hello" considered as important word with the weight of "2.0". Suppose
> the
> > term frequency for this word at field "a" is 3 and at field "b" is 6 for
> > document 1. Therefor the score value would be 2*3+(2*6)^2. I want to
> > calculate this score based on these fields and put it in the index for
> > retrieving. My question would be how can I do such thing? First I did
> > consider using term component for calculating this value from outside and
> > put it back to Solr index, but it seems it is not efficient enough.
> >
> > Thank you very much.
> > Best regards.
> >
> > --
> > A.Nazemian
> >
>



-- 
A.Nazemian

Re: Extending solr analysis in index time

2015-01-11 Thread Ali Nazemian

Dear Alexandre,

I did not tried updaterequestprocessor yet. Can I access to term
frequencies at this level? I dont want to calculate term frequencies once
more while lucene already calculate them in reverse index?
Thank you very much.
 On Jan 11, 2015 7:49 PM, "Alexandre Rafalovitch" 
wrote:

> Your description uses the terms Solr/Lucene uses but perhaps not in
> the same way we do. That might explain the confusion.
>
> It sounds - on a high level - that you want to create a field based on
> a combination of a couple of other fields during indexing stage. Have
> you tried UpdateRequestProcessors? They have access to the full
> document when it is sent and can do whatever they want with it.
>
> Regards,
>Alex.
> 
> Sign up for my Solr resources newsletter at http://www.solr-start.com/
>
>
> On 11 January 2015 at 10:55, Ali Nazemian  wrote:
> > Dear Jack,
> > Hi,
> > I think you misunderstood my need. I dont want to change the default
> > scoring behavior of Lucene (tf-idf) I just want to have another field to
> do
> > sorting for some specific queries (not all the search business), however
> I
> > am aware of Lucene payload.
> > Thank you very much.
> >
> > On Sun, Jan 11, 2015 at 7:15 PM, Jack Krupansky <
> jack.krupan...@gmail.com>
> > wrote:
> >
> >> You would do that with a custom similarity (scoring) class. That's an
> >> expert feature. In fact a SUPER-expert feature.
> >>
> >> Start by completely familiarizing yourself with how TF*IDF  similarity
> >> already works:
> >>
> >>
> http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
> >>
> >> And to use your custom similarity class in Solr:
> >>
> >>
> https://cwiki.apache.org/confluence/display/solr/Other+Schema+Elements#OtherSchemaElements-Similarity
> >>
> >>
> >> -- Jack Krupansky
> >>
> >> On Sun, Jan 11, 2015 at 9:04 AM, Ali Nazemian 
> >> wrote:
> >>
> >> > Hi everybody,
> >> >
> >> > I am going to add some analysis to Solr at the index time. Here is
> what I
> >> > am considering in my mind:
> >> > Suppose I have two different fields for Solr schema, field "a" and
> field
> >> > "b". I am going to use the created reverse index in a way that some
> terms
> >> > are considered as important ones and tell lucene to calculate a value
> >> based
> >> > on these terms frequency per each document. For example let the word
> >> > "hello" considered as important word with the weight of "2.0". Suppose
> >> the
> >> > term frequency for this word at field "a" is 3 and at field "b" is 6
> for
> >> > document 1. Therefor the score value would be 2*3+(2*6)^2. I want to
> >> > calculate this score based on these fields and put it in the index for
> >> > retrieving. My question would be how can I do such thing? First I did
> >> > consider using term component for calculating this value from outside
> and
> >> > put it back to Solr index, but it seems it is not efficient enough.
> >> >
> >> > Thank you very much.
> >> > Best regards.
> >> >
> >> > --
> >> > A.Nazemian
> >> >
> >>
> >
> >
> >
> > --
> > A.Nazemian
>

Re: Extending solr analysis in index time

2015-01-11 Thread Ali Nazemian

Dear Jack,
Thank you very much.
Yeah I was thinking of function query for sorting, but I have to problems
in this case, 1) function query do the process at query time which I dont
want to. 2) I also want to have the score field for retrieving and showing
to users.

Dear Alexandre,
Here is some more explanation about the business behind the question:
I am going to provide a field for each document, lets refer it as
"document_score". I am going to fill this field based on the information
that could be extracted from Lucene reverse index. Assume I have a list of
terms, called important terms and I am going to extract the term frequency
for each of the terms inside this list per each document. To be honest I
want to use the term frequency for calculating "document_score".
"document_score" should be storable since I am going to retrieve this field
for each document. I also want to do sorting on "document_store" in case of
preferred by user.
I hope I did convey my point.
Best regards.


On Mon, Jan 12, 2015 at 12:53 AM, Jack Krupansky 
wrote:

> Won't function queries do the job at query time? You can add or multiply
> the tf*idf score by a function of the term frequency of arbitrary terms,
> using the tf, mul, and add functions.
>
> See:
> https://cwiki.apache.org/confluence/display/solr/Function+Queries
>
> -- Jack Krupansky
>
> On Sun, Jan 11, 2015 at 10:55 AM, Ali Nazemian 
> wrote:
>
> > Dear Jack,
> > Hi,
> > I think you misunderstood my need. I dont want to change the default
> > scoring behavior of Lucene (tf-idf) I just want to have another field to
> do
> > sorting for some specific queries (not all the search business), however
> I
> > am aware of Lucene payload.
> > Thank you very much.
> >
> > On Sun, Jan 11, 2015 at 7:15 PM, Jack Krupansky <
> jack.krupan...@gmail.com>
> > wrote:
> >
> > > You would do that with a custom similarity (scoring) class. That's an
> > > expert feature. In fact a SUPER-expert feature.
> > >
> > > Start by completely familiarizing yourself with how TF*IDF  similarity
> > > already works:
> > >
> > >
> >
> http://lucene.apache.org/core/4_10_3/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
> > >
> > > And to use your custom similarity class in Solr:
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/solr/Other+Schema+Elements#OtherSchemaElements-Similarity
> > >
> > >
> > > -- Jack Krupansky
> > >
> > > On Sun, Jan 11, 2015 at 9:04 AM, Ali Nazemian 
> > > wrote:
> > >
> > > > Hi everybody,
> > > >
> > > > I am going to add some analysis to Solr at the index time. Here is
> > what I
> > > > am considering in my mind:
> > > > Suppose I have two different fields for Solr schema, field "a" and
> > field
> > > > "b". I am going to use the created reverse index in a way that some
> > terms
> > > > are considered as important ones and tell lucene to calculate a value
> > > based
> > > > on these terms frequency per each document. For example let the word
> > > > "hello" considered as important word with the weight of "2.0".
> Suppose
> > > the
> > > > term frequency for this word at field "a" is 3 and at field "b" is 6
> > for
> > > > document 1. Therefor the score value would be 2*3+(2*6)^2. I want to
> > > > calculate this score based on these fields and put it in the index
> for
> > > > retrieving. My question would be how can I do such thing? First I did
> > > > consider using term component for calculating this value from outside
> > and
> > > > put it back to Solr index, but it seems it is not efficient enough.
> > > >
> > > > Thank you very much.
> > > > Best regards.
> > > >
> > > > --
> > > > A.Nazemian
> > > >
> > >
> >
> >
> >
> > --
> > A.Nazemian
> >
>



-- 
A.Nazemian

1 2 >

1 - 100 of 112 matches

Mail list logo