Re: dynamic field sorting

2017-03-22 Thread Mikhail Khludnev
Since it hits heap, moving to docValues might make sense.

On Wed, Mar 22, 2017 at 7:47 AM, Midas A  wrote:

> waiting for reply . Actually Heap utilization increases when we sort with
> dynamic fields
>
> On Tue, Mar 21, 2017 at 10:37 AM, Midas A  wrote:
>
> > Hi ,
> >
> > How can i improve the performance of dynamic field sorting .
> >
> > index size is : 20 GB
> >
> > Regards,
> > Midas
> >
>



-- 
Sincerely yours
Mikhail Khludnev


Re: dataimport to a smaller Solr farm

2017-03-22 Thread Mikhail Khludnev
Hello, Dean.

DIH is shard agnostic. How do you try to specify "a shard from the new
collection"?

On Tue, Mar 21, 2017 at 8:24 PM, deansg  wrote:

> Hello,
> My team often uses the /dataimport & /dih handlers to move items from one
> Solr collection to another. However, all the times we did that, the number
> of shards in the new collection was always the same or higher than in the
> old.
> Can /dataimport work if I have less shards in the new collection than in
> the
> old one? I tried specifying a shard from the new collection multiple times
> in the data-config file, and it didn't seem to work - there were no visible
> exceptions, but most items simply didn't enter the new collection.
> Dean.
>
>
>
> --
> View this message in context: http://lucene.472066.n3.
> nabble.com/dataimport-to-a-smaller-Solr-farm-tp4326067.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Sincerely yours
Mikhail Khludnev


RE: Exception while integrating openNLP with Solr

2017-03-22 Thread Markus Jelsma
Hello - there is an underlying SIOoBE causing you trouble:

at java.lang.Thread.run(Thread.java:745)
*Caused by: java.lang.ArrayIndexOutOfBoundsException: 1*
at
opennlp.tools.lemmatizer.SimpleLemmatizer.(SimpleLemmatizer.java:46)

Regards,,
Marks
 
-Original message-
> From:aruninfo100 
> Sent: Wednesday 22nd March 2017 1:33
> To: solr-user@lucene.apache.org
> Subject: Exception while integrating openNLP with Solr
> 
> Hi,
> 
> I am trying to integrate openNLP with Solr.
> 
> The fieldtype is :
> 
>positionIncrementGap="100">
>   
>  sentenceModel="opennlp/en-sent.bin"  tokenizerModel="opennlp/en-token.bin"/>
>  posTaggerModel="opennlp/en-pos-maxent.bin"/>
> dictionary="opennlp/en-lemmatizer.txt"/>
>   
> 
> 
> en-lemmatizer.txt->The file has a size close to 5mb.
> I am using the lemmatizer dictionary from below link:
> 
> https://raw.githubusercontent.com/richardwilly98/elasticsearch-opennlp-auto-tagging/master/src/main/resources/models/en-lemmatizer.dict
> 
>   
> field schema:
> 
> 
> 
> When I try to index I get the following error:
> 
> error :Error from server at http://localhost:8983/solr/star: Exception
> writing document id 578df0de-6adc-4ca2-9d5d-23c5c088f83a to the index;
> possible analysis error.
> 
> solr.log:
> 
> 
> 2017-03-22 00:03:42.477 INFO  (qtp1389647288-14) [   x:star]
> o.a.s.u.p.LogUpdateProcessorFactory [star]  webapp=/solr path=/update
> params={wt=javabin&version=2}{} 0 116
> 2017-03-22 00:03:42.478 ERROR (qtp1389647288-14) [   x:star]
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Exception
> writing document id 303e190b-b02c-4927-9669-733e76164f61 to the index;
> possible analysis error.
>   at
> org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:181)
>   at
> org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:68)
>   at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>   at
> org.apache.solr.update.processor.AddSchemaFieldsUpdateProcessorFactory$AddSchemaFieldsUpdateProcessor.processAdd(AddSchemaFieldsUpdateProcessorFactory.java:335)
>   at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>   at
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
>   at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>   at
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
>   at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>   at
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
>   at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>   at
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
>   at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>   at
> org.apache.solr.update.processor.FieldNameMutatingUpdateProcessorFactory$1.processAdd(FieldNameMutatingUpdateProcessorFactory.java:74)
>   at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>   at
> org.apache.solr.update.processor.FieldMutatingUpdateProcessor.processAdd(FieldMutatingUpdateProcessor.java:117)
>   at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>   at
> org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:939)
>   at
> org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1094)
>   at
> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:720)
>   at
> org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
>   at
> org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
>   at
> org.apache.solr.update.processor.AbstractDefaultValueUpdateProcessorFactory$DefaultValueUpdateProcessor.processAdd(AbstractDefaultValueUpdateProcessorFactory.java:93)
>   at
> org.apache.solr.handler.loader.JavabinLoader$1.update(JavabinLoader.java:97)
>   at
> org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:179)
>   at
> org.apache.solr.client.solrj.request.JavaBin

RE: Exception while integrating openNLP with Solr

2017-03-22 Thread aruninfo100
Hi,

I was able to resolve the issue.But when I run the indexing process it is
taking too long to index bigger documents and some times I get java heap
memory exception.
How can I improve the performance while using dictionary lemmmatizers.

Thanks and Regards,
Arun



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Exception-while-integrating-openNLP-with-Solr-tp4326146p4326197.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Exception while integrating openNLP with Solr

2017-03-22 Thread Markus Jelsma
Hello - you need to increase the heap to work around the out of memory 
exception. There is not much you can to do increase the indexing speed using 
OpenNLP.

Regards,
Markus
 
-Original message-
> From:aruninfo100 
> Sent: Wednesday 22nd March 2017 12:27
> To: solr-user@lucene.apache.org
> Subject: RE: Exception while integrating openNLP with Solr
> 
> Hi,
> 
> I was able to resolve the issue.But when I run the indexing process it is
> taking too long to index bigger documents and some times I get java heap
> memory exception.
> How can I improve the performance while using dictionary lemmmatizers.
> 
> Thanks and Regards,
> Arun
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Exception-while-integrating-openNLP-with-Solr-tp4326146p4326197.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 


Re: Stored value for highlighting from different field?

2017-03-22 Thread Matthew Caruana Galizia
An ICIJ engineer, Julien Martin, has since developed a patch for this. We’d 
appreciate any feedback and attention that might help get this integrated: 
https://issues.apache.org/jira/browse/SOLR-1105 


> On 1 Mar 2017, at 17:03, Matthew Caruana Galizia  > wrote:
> 
> We’re currently using copyField directives in our schema to copy the same 
> text to different fields that use different analysers. For example, assuming 
> the original field contained in the document payload sent to the update 
> handler is called “tika_output", it is copied to “text”, 
> “text_case_sensitive” and “text_accent_sensitive”.
> 
> In order to avoid inflating the size of the index, “tika_output" has 
> indexed=false and stored=true, while “text” and friends have indexed=true and 
> stored=false.
> 
> We’re using the unified highlighter. I’ve read the code in 
> UnifiedHighlighter.java, which clearly shows that the field to be highlighted 
> must be stored. Therefore, searching on text_case_sensitive doesn’t yield 
> highlighted results. Storing the field value redundantly would mean tripling 
> my storage costs.
> 
> I see that other people have brought up this issue before:
> 
> https://issues.apache.org/jira/browse/SOLR-1105 
> 
> https://issues.apache.org/jira/browse/SOLR-5276 
> 
> 
> Is there anything that can be done? If it comes down to subclassing the 
> unified highlighter, does anyone have any recommendations for doing this?





SolrJ getHighlighting() does not return results in order

2017-03-22 Thread leoperezpulido
Hi,

Implementing highlighting with *SolrJ* does not return results in the proper
order while I "page" through results. This not seems to be a problem with
the RESTful API.

// ...
query.setQuery("text");
/*
The problem is when I set start to get different "pages",
the results returned by getHighlighting() are disordered.
*/
query.setStart(0);
query.setSort("score", SolrQuery.ORDER.desc);
query.setIncludeScore(true);

query.setHighlight(true);
query.addHightlightField("content");
// ...

Take the example of a simple index with a field named content and field's
values like:
Document 1
Document 2
Document 3
etc.

With the results returned by SolrDocumentList and with the RESTful API, I
can paginate in the normal way, and the results remain ordered. This is not
the case when I get results from getHighlighting().



--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-getHighlighting-does-not-return-results-in-order-tp4326218.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: model building

2017-03-22 Thread Joe Obernberger
Thank you Tim.  I appreciated the tips.  At this point, I'm just trying 
to understand how to use it.  The 30 tweets that I've selected so far, 
are, in fact threatening.  The things people say!  My favorite so far is 
'disingenuous twat waffle'.  No kidding.


The issue that I'm having is not with the model, it's with creating the 
model from a query other than *:*.


Example:

update(models2, batchSize="50",
 train(TRAINING,
  features(TRAINING,
 q="*:*",
 featureSet="threat1",
 field="ClusterText",
 outcome="out_i",
 positiveLabel=1,
 numTerms=100),
  q="*:*",
  name="threat1",
  field="ClusterText",
  outcome="out_i",
  maxIterations="100"))

Works great.  Makes a model - model works - can see reasonable results.  
However, say I've tagged a training set inside a larger collection 
called COL1 with a field called JoeID - like this:


update(models2, batchSize="50",
 train(COL1,
  features(COL1,
 q="JoeID:Training",
 featureSet="threat2",
 field="ClusterText",
 outcome="out_i",
 positiveLabel=1,
 numTerms=1000),
  q="JoeID:Training",
  name="threat2",
  field="ClusterText",
  outcome="out_i",
  maxIterations="100"))

This does not work as expected.  I can query the COL1 collection for 
JoeID:Training, and get a result set that I want to train on, but the 
model creation seems to not work.  At this point, if I want to make a 
model, I need to create a collection, put the training set into it, and 
then train on *:*.  This is fine, but I'm not sure if it's how it is 
supposed to work.


-Joe


On 3/21/2017 10:17 PM, Tim Casey wrote:

Joe,

To do this correctly, soundly, you will need to sample the data and mark
them as threatening or neutral.  You can probably expand on this quite a
bit, but that would be a good start.  You can then draw another set of
samples and see how you did.  You use one to train and one to validate.

What you are doing is probably just noise, from a model point of view, and
it will probably not make too much difference how you index/query/model
through the noise.

I don't mean this critically, just plainly.  Effectively the less
mathematically correctly you do this process, the more anecdotal the result.

tim


On Mon, Mar 20, 2017 at 4:42 PM, Joel Bernstein  wrote:


I've only tested with the training data in it's own collection, but it was
designed for multiple training sets in the same collection.

I suspect you're training set is too small to get a reliable model from.
The training sets we tested with were considerably larger.

All the idfs_ds values being the same seems odd though. The idfs_ds in
particular were designed to be accurate when there are multiple training
sets in the same collection.

Joel Bernstein
http://joelsolr.blogspot.com/

On Mon, Mar 20, 2017 at 5:41 PM, Joe Obernberger <
joseph.obernber...@gmail.com> wrote:


If I put the training data into its own collection and use q="*:*", then
it works correctly.  Is that a requirement?
Thank you.

-Joe



On 3/20/2017 3:47 PM, Joe Obernberger wrote:


I'm trying to build a model using tweets.  I've manually tagged 30

tweets

as threatening, and 50 random tweets as non-threatening.  When I build

the

mode with:

update(models2, batchSize="50",
  train(UNCLASS,
   features(UNCLASS,
  q="ProfileID:PROFCLUST1",
  featureSet="threatFeatures3",
  field="ClusterText",
  outcome="out_i",
  positiveLabel=1,
  numTerms=250),
   q="ProfileID:PROFCLUST1",
   name="threatModel3",
   field="ClusterText",
   outcome="out_i",
   maxIterations="100"))

It appears to work, but all the idfs_ds values are identical. The
terms_ss values look reasonable, but nearly all the weights_ds are 1.0.
For out_i it is either -1 for non-threatening tweets, and +1 for
threatening tweets.  I'm trying to follow along with Joel Bernstein's
excellent post here:
http://joelsolr.blogspot.com/2017/01/deploying-ai-alerting-s
ystem-with-solrs.html

Tips?

Thank you!

-Joe






Re: model building

2017-03-22 Thread Joel Bernstein
I did a review of the code and it was definitely written to support having
multiple training sets in the same collection. So, it sounds like something
is not working as designed.

I planned on testing out model building with different types of training
sets anyway, so I'll can comment on my findings in the ticket.

Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Mar 22, 2017 at 9:58 AM, Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Thank you Tim.  I appreciated the tips.  At this point, I'm just trying to
> understand how to use it.  The 30 tweets that I've selected so far, are, in
> fact threatening.  The things people say!  My favorite so far is
> 'disingenuous twat waffle'.  No kidding.
>
> The issue that I'm having is not with the model, it's with creating the
> model from a query other than *:*.
>
> Example:
>
> update(models2, batchSize="50",
>  train(TRAINING,
>   features(TRAINING,
>  q="*:*",
>  featureSet="threat1",
>  field="ClusterText",
>  outcome="out_i",
>  positiveLabel=1,
>  numTerms=100),
>   q="*:*",
>   name="threat1",
>   field="ClusterText",
>   outcome="out_i",
>   maxIterations="100"))
>
> Works great.  Makes a model - model works - can see reasonable results.
> However, say I've tagged a training set inside a larger collection called
> COL1 with a field called JoeID - like this:
>
> update(models2, batchSize="50",
>  train(COL1,
>   features(COL1,
>  q="JoeID:Training",
>  featureSet="threat2",
>  field="ClusterText",
>  outcome="out_i",
>  positiveLabel=1,
>  numTerms=1000),
>   q="JoeID:Training",
>   name="threat2",
>   field="ClusterText",
>   outcome="out_i",
>   maxIterations="100"))
>
> This does not work as expected.  I can query the COL1 collection for
> JoeID:Training, and get a result set that I want to train on, but the model
> creation seems to not work.  At this point, if I want to make a model, I
> need to create a collection, put the training set into it, and then train
> on *:*.  This is fine, but I'm not sure if it's how it is supposed to work.
>
> -Joe
>
>
>
> On 3/21/2017 10:17 PM, Tim Casey wrote:
>
>> Joe,
>>
>> To do this correctly, soundly, you will need to sample the data and mark
>> them as threatening or neutral.  You can probably expand on this quite a
>> bit, but that would be a good start.  You can then draw another set of
>> samples and see how you did.  You use one to train and one to validate.
>>
>> What you are doing is probably just noise, from a model point of view, and
>> it will probably not make too much difference how you index/query/model
>> through the noise.
>>
>> I don't mean this critically, just plainly.  Effectively the less
>> mathematically correctly you do this process, the more anecdotal the
>> result.
>>
>> tim
>>
>>
>> On Mon, Mar 20, 2017 at 4:42 PM, Joel Bernstein 
>> wrote:
>>
>> I've only tested with the training data in it's own collection, but it was
>>> designed for multiple training sets in the same collection.
>>>
>>> I suspect you're training set is too small to get a reliable model from.
>>> The training sets we tested with were considerably larger.
>>>
>>> All the idfs_ds values being the same seems odd though. The idfs_ds in
>>> particular were designed to be accurate when there are multiple training
>>> sets in the same collection.
>>>
>>> Joel Bernstein
>>> http://joelsolr.blogspot.com/
>>>
>>> On Mon, Mar 20, 2017 at 5:41 PM, Joe Obernberger <
>>> joseph.obernber...@gmail.com> wrote:
>>>
>>> If I put the training data into its own collection and use q="*:*", then
 it works correctly.  Is that a requirement?
 Thank you.

 -Joe



 On 3/20/2017 3:47 PM, Joe Obernberger wrote:

 I'm trying to build a model using tweets.  I've manually tagged 30
>
 tweets
>>>
 as threatening, and 50 random tweets as non-threatening.  When I build
>
 the
>>>
 mode with:
>
> update(models2, batchSize="50",
>   train(UNCLASS,
>features(UNCLASS,
>   q="ProfileID:PROFCLUST1",
>   featureSet="threatFeatures3",
>   field="ClusterText",
>   outcome="out_i",
> 

Both Nodes in shard think they are leader

2017-03-22 Thread philippa griggs
Hello,


I’m using Solr Cloud version 5.4.1.  I have two cores in a shard (a leader and 
replica) every so often they both go into recovery/down and then come back up. 
However when they come back, they both think they are leader.


I then have to manually step in, stop them both, start one and wait till its 
leader before starting the second one.


How anyone else seen this before or have any suggestions as to why this is 
happening?


Many thanks

Philippa


Re: Solr Delete By Id Out of memory issue

2017-03-22 Thread Chris Hostetter

: OK, The whole DBQ thing baffles the heck out of me so this may be
: totally off base. But would committing help here? Or at least be worth
: a test?

ths isn't DBQ -- the OP specifically said deleteById, and that the 
oldDeletes map (only used for DBI) was the problem acording to the heap 
dumps they looked at.

I suspect you are correct about the root cause of the OOMs ... perhaps the 
OP isn't using hard/soft commits effectively enough and the uncommitted 
data is what's causing the OOM ... hard to say w/o more details. or 
confirmation of exactly what the OP was looking at in their claim below 
about the heap dump


: > : Thanks for replying. We are using Solr 6.1 version. Even I saw that it is
: > : bounded by 1K count, but after looking at heap dump I was amazed how can 
it
: > : keep more than 1K entries. But Yes I see around 7M entries according to
: > : heap dump and around 17G of memory occupied by BytesRef there.
: >
: > what exactly are you looking at when you say you see "7M entries" ?
: >
: > are you sure you aren't confusing the keys in oldDeletes with other
: > instances of BytesRef in the JVM?


-Hoss
http://www.lucidworks.com/


RE: Exception while integrating openNLP with Solr

2017-03-22 Thread aruninfo100
Hi 

I am really finding it difficult to index documents using openNLP
lemmatizer.The indexing is taking too much time(including commit).Is there a
way to optimize or increase the performance.
Also it will be helpful in knowing different opennlp lemmatizer
implementations which are also  good performance based.

Thanks,
Arun



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Exception-while-integrating-openNLP-with-Solr-tp4326146p4326296.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Exception while integrating openNLP with Solr

2017-03-22 Thread Markus Jelsma
Hi - We are not having large issues using OpenNLP for POS-tagging in Lucene. 
But you mention commits, a committing with or without POS payloads is hardly 
any different so  commits should be unaffected. Maybe you have another issue? 
Perhaps use a sampler to pinpoint the problem.

Markus

 
 
-Original message-
> From:aruninfo100 
> Sent: Wednesday 22nd March 2017 18:30
> To: solr-user@lucene.apache.org
> Subject: RE: Exception while integrating openNLP with Solr
> 
> Hi 
> 
> I am really finding it difficult to index documents using openNLP
> lemmatizer.The indexing is taking too much time(including commit).Is there a
> way to optimize or increase the performance.
> Also it will be helpful in knowing different opennlp lemmatizer
> implementations which are also  good performance based.
> 
> Thanks,
> Arun
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Exception-while-integrating-openNLP-with-Solr-tp4326146p4326296.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 


Custom FieldTypes

2017-03-22 Thread Ronald Wood
I have been mulling over the usefulness of a new Hash field type for being able 
to validate data that is indexed but not stored. Basically, I’d use copy 
directives to copy all fields to be hashed to the new hash field and store a 
SHA-256 hash as a string. I’m still not sure how valuable it would for us. 
Maybe someone has already done something similar?

However, I was wondering in general about how one would go about implementing 
and integrating a few FieldType.

Looking at 
UUIDField
 as an example, the work seems moderate. But then the question is, how would I 
integrate it? Just drop in a new jar with the class or does it have to be 
integrated into Solr as a proper commit?

If it were valuable for others, I would love to contribute it, should we go 
ahead with it. But I already have had trouble getting our Legal Dept. to give 
the go ahead to contribute the code that worked for re-indexing docValues in 
place (SOLR-9437). ☹

-Ronald S. Wood



Regex Phrases

2017-03-22 Thread Mark Johnson
Is it possible to configure Solr to treat text that matches a regex as a
phrase?

I have a database full of products, and the Title and Description fields
are text_en, tokenized via the StandardTokenizerFactory. This works in most
cases, but a number of products have names like:

 - Vitamin A
 - Vitamin-A
 - Vitamin B12
 - Vitamin B-12
...and so on

I have a regex that will match all of the permutations and would like to
configure the field type so that anything that matches the regex pattern is
treated as a single token, instead of being broken up by spaces, etc. Is
that possible?

-- 
*This message is intended only for the use of the individual or entity to 
which it is addressed and may contain information that is privileged, 
confidential and exempt from disclosure under applicable law. If you have 
received this message in error, you are hereby notified that any use, 
dissemination, distribution or copying of this message is prohibited. If 
you have received this communication in error, please notify the sender 
immediately and destroy the transmitted information.*


RE: Exception while integrating openNLP with Solr

2017-03-22 Thread aruninfo100
Hi,
Thanks for the reply.

Kindly find  the filed type scghema i am using :

 


Does the *opennlp_text* field be indexed="true"?

 
  


   
  


Here the en-lemmatizer.txt is 7mb in size.Without lemmatization usually the
whole indexing process takes on an average basis 2-3mts,but here it is
taking more than 1hr and continuing.Is the scenario related to the
lemmatizer file.
Could you please guide me.

Thanks,
Arun



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Exception-while-integrating-openNLP-with-Solr-tp4326146p4326311.html
Sent from the Solr - User mailing list archive at Nabble.com.


Tuple object implementing Serializable

2017-03-22 Thread Kiran Chitturi
Hi,

Is there any reason that Tuple object

does
not implement Serializable like SolrDocumentBase which does implement

Serializable?

In spark-solr  library, I want to
return an RDD of Tuple objects but it fails because the Tuple class does
not implement Serializable

2017-03-22 01:45:51,230 [Executor task launch worker-0] ERROR Executor  -
> Exception in task 0.0 in stage 0.0 (TID 0)
> java.io.NotSerializableException: org.apache.solr.client.solrj.io.Tuple
> Serialization stack:
> - object not serializable (class:
> org.apache.solr.client.solrj.io.Tuple, value:
> org.apache.solr.client.solrj.io.Tuple@365e4da1)
> - element of array (index: 0)
> - array (class [Lorg.apache.solr.client.solrj.io.Tuple;, size 10)
> at
> org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
> at
> org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
> at
> org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:324)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)


To get past this error, we need to implement Serializable for Tuple object.
Is there a reason not to do that?

We are working past this error by doing conversions from Tuple object to
other objects but it would be ideal (in terms of performance) if we can
just deal with Tuple objects directly in Spark world.

Thanks,
-- 
Kiran Chitturi


Re: Custom FieldTypes

2017-03-22 Thread Alexandre Rafalovitch
Can this be done at the UpdateRequestProcessor stage?

Regards,
Alex


On 22 Mar 2017 1:48 PM, "Ronald Wood"  wrote:

I have been mulling over the usefulness of a new Hash field type for being
able to validate data that is indexed but not stored. Basically, I’d use
copy directives to copy all fields to be hashed to the new hash field and
store a SHA-256 hash as a string. I’m still not sure how valuable it would
for us. Maybe someone has already done something similar?

However, I was wondering in general about how one would go about
implementing and integrating a few FieldType.

Looking at UUIDField as an
example, the work seems moderate. But then the question is, how would I
integrate it? Just drop in a new jar with the class or does it have to be
integrated into Solr as a proper commit?

If it were valuable for others, I would love to contribute it, should we go
ahead with it. But I already have had trouble getting our Legal Dept. to
give the go ahead to contribute the code that worked for re-indexing
docValues in place (SOLR-9437). ☹

-Ronald S. Wood


Re: Regex Phrases

2017-03-22 Thread Erick Erickson
Take a close look at WordDelimiterFilterFactory, it's designed to deal
with things like part numbers, phone numbers and the like, and the
example you gave is in the same class of problem I think. It'll take
a bit to get your head around what it does, but it'll perfom better
than regexes, assuming you can get what you need out of it.

And the admin/analysis page will help you _greatly_ in understanding
what the effects of the various parameters are.

Best,
Erick

On Wed, Mar 22, 2017 at 11:06 AM, Mark Johnson
 wrote:
> Is it possible to configure Solr to treat text that matches a regex as a
> phrase?
>
> I have a database full of products, and the Title and Description fields
> are text_en, tokenized via the StandardTokenizerFactory. This works in most
> cases, but a number of products have names like:
>
>  - Vitamin A
>  - Vitamin-A
>  - Vitamin B12
>  - Vitamin B-12
> ...and so on
>
> I have a regex that will match all of the permutations and would like to
> configure the field type so that anything that matches the regex pattern is
> treated as a single token, instead of being broken up by spaces, etc. Is
> that possible?
>
> --
> *This message is intended only for the use of the individual or entity to
> which it is addressed and may contain information that is privileged,
> confidential and exempt from disclosure under applicable law. If you have
> received this message in error, you are hereby notified that any use,
> dissemination, distribution or copying of this message is prohibited. If
> you have received this communication in error, please notify the sender
> immediately and destroy the transmitted information.*


Re: Regex Phrases

2017-03-22 Thread Mark Johnson
Awesome, thank you much!

On Wed, Mar 22, 2017 at 2:38 PM, Erick Erickson 
wrote:

> Take a close look at WordDelimiterFilterFactory, it's designed to deal
> with things like part numbers, phone numbers and the like, and the
> example you gave is in the same class of problem I think. It'll take
> a bit to get your head around what it does, but it'll perfom better
> than regexes, assuming you can get what you need out of it.
>
> And the admin/analysis page will help you _greatly_ in understanding
> what the effects of the various parameters are.
>
> Best,
> Erick
>
> On Wed, Mar 22, 2017 at 11:06 AM, Mark Johnson
>  wrote:
> > Is it possible to configure Solr to treat text that matches a regex as a
> > phrase?
> >
> > I have a database full of products, and the Title and Description fields
> > are text_en, tokenized via the StandardTokenizerFactory. This works in
> most
> > cases, but a number of products have names like:
> >
> >  - Vitamin A
> >  - Vitamin-A
> >  - Vitamin B12
> >  - Vitamin B-12
> > ...and so on
> >
> > I have a regex that will match all of the permutations and would like to
> > configure the field type so that anything that matches the regex pattern
> is
> > treated as a single token, instead of being broken up by spaces, etc. Is
> > that possible?
> >
> > --
> > *This message is intended only for the use of the individual or entity to
> > which it is addressed and may contain information that is privileged,
> > confidential and exempt from disclosure under applicable law. If you have
> > received this message in error, you are hereby notified that any use,
> > dissemination, distribution or copying of this message is prohibited. If
> > you have received this communication in error, please notify the sender
> > immediately and destroy the transmitted information.*
>



-- 

Best Regards,

*Mark Johnson* | .NET Software Engineer

Office: 603-392-7017

Emerson Ecologics, LLC | 1230 Elm Street | Suite 301 | Manchester NH | 03101

  

*Supporting The Practice Of Healthy Living*









-- 
*This message is intended only for the use of the individual or entity to 
which it is addressed and may contain information that is privileged, 
confidential and exempt from disclosure under applicable law. If you have 
received this message in error, you are hereby notified that any use, 
dissemination, distribution or copying of this message is prohibited. If 
you have received this communication in error, please notify the sender 
immediately and destroy the transmitted information.*


Re: Custom FieldTypes

2017-03-22 Thread Ronald Wood
I suppose it could be, but the flexibility of using copy directives is 
appealing for handling multiple fields as defined in the schema.

Since I have rarely looked at the UpdateRequestProcessor, I guess I don’t know 
if it could take multiple fields to hash, and if so how that would be expressed.

-R

On 3/22/17, 2:21 PM, "Alexandre Rafalovitch"  wrote:

Can this be done at the UpdateRequestProcessor stage?

Regards,
Alex


On 22 Mar 2017 1:48 PM, "Ronald Wood"  wrote:

I have been mulling over the usefulness of a new Hash field type for being
able to validate data that is indexed but not stored. Basically, I’d use
copy directives to copy all fields to be hashed to the new hash field and
store a SHA-256 hash as a string. I’m still not sure how valuable it would
for us. Maybe someone has already done something similar?

However, I was wondering in general about how one would go about
implementing and integrating a few FieldType.

Looking at UUIDField as an
example, the work seems moderate. But then the question is, how would I
integrate it? Just drop in a new jar with the class or does it have to be
integrated into Solr as a proper commit?

If it were valuable for others, I would love to contribute it, should we go
ahead with it. But I already have had trouble getting our Legal Dept. to
give the go ahead to contribute the code that worked for re-indexing
docValues in place (SOLR-9437). ☹

-Ronald S. Wood




Re: Custom FieldTypes

2017-03-22 Thread Alexandre Rafalovitch
You'd use CloneField URP
http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/update/processor/CloneFieldUpdateProcessorFactory.html

Then you do your custom algorithm. Or - as I just remembered - use one
of the hash ones described in dedupe section:
https://cwiki.apache.org/confluence/display/solr/De-Duplication (which
don't see to require CloneField anyway).

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 22 March 2017 at 14:55, Ronald Wood  wrote:
> I suppose it could be, but the flexibility of using copy directives is 
> appealing for handling multiple fields as defined in the schema.
>
> Since I have rarely looked at the UpdateRequestProcessor, I guess I don’t 
> know if it could take multiple fields to hash, and if so how that would be 
> expressed.
>
> -R
>
> On 3/22/17, 2:21 PM, "Alexandre Rafalovitch"  wrote:
>
> Can this be done at the UpdateRequestProcessor stage?
>
> Regards,
> Alex
>
>
> On 22 Mar 2017 1:48 PM, "Ronald Wood"  wrote:
>
> I have been mulling over the usefulness of a new Hash field type for being
> able to validate data that is indexed but not stored. Basically, I’d use
> copy directives to copy all fields to be hashed to the new hash field and
> store a SHA-256 hash as a string. I’m still not sure how valuable it would
> for us. Maybe someone has already done something similar?
>
> However, I was wondering in general about how one would go about
> implementing and integrating a few FieldType.
>
> Looking at UUIDField master/solr/core/src/java/org/apache/solr/schema/UUIDField.java> as an
> example, the work seems moderate. But then the question is, how would I
> integrate it? Just drop in a new jar with the class or does it have to be
> integrated into Solr as a proper commit?
>
> If it were valuable for others, I would love to contribute it, should we 
> go
> ahead with it. But I already have had trouble getting our Legal Dept. to
> give the go ahead to contribute the code that worked for re-indexing
> docValues in place (SOLR-9437). ☹
>
> -Ronald S. Wood
>
>


Re: Custom FieldTypes

2017-03-22 Thread Ronald Wood
Thanks. I had seen that page but had passed it over since I don’t want to do 
de-duping (text fields with the exact same text are possible and not cause for 
de-dupe).

If I want just to store the signature, it looks like I define the 
signatureField in the configuration and set overwriteDupes to true (since I 
don’t actually regard them as dupes).

I guess the one downside to this is that the processor will run regardless of 
the document type (we have 6 types and only 3 need hashes on text). Or maybe 
empty values for fields stops the processor? No signature is needed when the 
text fields are not provided.

-R

On 3/22/17, 3:20 PM, "Alexandre Rafalovitch"  wrote:

You'd use CloneField URP

http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/update/processor/CloneFieldUpdateProcessorFactory.html

Then you do your custom algorithm. Or - as I just remembered - use one
of the hash ones described in dedupe section:
https://cwiki.apache.org/confluence/display/solr/De-Duplication (which
don't see to require CloneField anyway).

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 22 March 2017 at 14:55, Ronald Wood  wrote:
> I suppose it could be, but the flexibility of using copy directives is 
appealing for handling multiple fields as defined in the schema.
>
> Since I have rarely looked at the UpdateRequestProcessor, I guess I don’t 
know if it could take multiple fields to hash, and if so how that would be 
expressed.
>
> -R
>
> On 3/22/17, 2:21 PM, "Alexandre Rafalovitch"  wrote:
>
> Can this be done at the UpdateRequestProcessor stage?
>
> Regards,
> Alex
>
>
> On 22 Mar 2017 1:48 PM, "Ronald Wood"  wrote:
>
> I have been mulling over the usefulness of a new Hash field type for 
being
> able to validate data that is indexed but not stored. Basically, I’d 
use
> copy directives to copy all fields to be hashed to the new hash field 
and
> store a SHA-256 hash as a string. I’m still not sure how valuable it 
would
> for us. Maybe someone has already done something similar?
>
> However, I was wondering in general about how one would go about
> implementing and integrating a few FieldType.
>
> Looking at UUIDField master/solr/core/src/java/org/apache/solr/schema/UUIDField.java> as an
> example, the work seems moderate. But then the question is, how would 
I
> integrate it? Just drop in a new jar with the class or does it have 
to be
> integrated into Solr as a proper commit?
>
> If it were valuable for others, I would love to contribute it, should 
we go
> ahead with it. But I already have had trouble getting our Legal Dept. 
to
> give the go ahead to contribute the code that worked for re-indexing
> docValues in place (SOLR-9437). ☹
>
> -Ronald S. Wood
>
>




Concatenating streams in streaming expressions

2017-03-22 Thread Matt Magnusson
Hello;

Does anyone know of a way where I can concatenate source streams?

For example if I have two searches
search(prod,q="content:cat",fl="id,score",sort="score desc")
search(prod,q="content:dog",fl="id,score",sort="score desc")


Is there a way to have these come out as one stream. I've been trying
to use the executor function by storing these searches as expr_s.  I
however, can't figure out how to merge the output of these back into
one stream.  If I run the following code,

executor(search(queries, q="*:*",fl="id, expr_s", sort="id asc",
qt="/export")). It gives this output:

{
  "result-set": {
"docs": [
  {
"EOF": true,
"RESPONSE_TIME": 32
  }
]
  }
}

So not the underlying tuples returned.

I want it the return to be like this for all individual searches
combined into one stream.

{
  "result-set": {
"docs": [
  {
"score": 12.340755,
"id": "9a49d7d6f5b3cc597f8e55e66bb6d96438b670d1"
  },
  {
"score": 11.879734,
"id": "887d349fc9390a87ac7fd4209af59af61531ad06"
  },
  {
"score": 11.82577,
"id": "c91971049ab95cb32dc2d0f8d616aad25ee04bb7"
  },...




 I know the searches are working correctly using the executor function
because I can have them save output back to solr if I also include the
update and commit functions in the expr_s field in my source queries
collection.  Thanks

Matt


Re: Regex Phrases

2017-03-22 Thread Susheel Kumar
I have used PatternReplaceFilterFactory in some of these situations. e.g.
below

  

On Wed, Mar 22, 2017 at 2:54 PM, Mark Johnson  wrote:

> Awesome, thank you much!
>
> On Wed, Mar 22, 2017 at 2:38 PM, Erick Erickson 
> wrote:
>
> > Take a close look at WordDelimiterFilterFactory, it's designed to deal
> > with things like part numbers, phone numbers and the like, and the
> > example you gave is in the same class of problem I think. It'll take
> > a bit to get your head around what it does, but it'll perfom better
> > than regexes, assuming you can get what you need out of it.
> >
> > And the admin/analysis page will help you _greatly_ in understanding
> > what the effects of the various parameters are.
> >
> > Best,
> > Erick
> >
> > On Wed, Mar 22, 2017 at 11:06 AM, Mark Johnson
> >  wrote:
> > > Is it possible to configure Solr to treat text that matches a regex as
> a
> > > phrase?
> > >
> > > I have a database full of products, and the Title and Description
> fields
> > > are text_en, tokenized via the StandardTokenizerFactory. This works in
> > most
> > > cases, but a number of products have names like:
> > >
> > >  - Vitamin A
> > >  - Vitamin-A
> > >  - Vitamin B12
> > >  - Vitamin B-12
> > > ...and so on
> > >
> > > I have a regex that will match all of the permutations and would like
> to
> > > configure the field type so that anything that matches the regex
> pattern
> > is
> > > treated as a single token, instead of being broken up by spaces, etc.
> Is
> > > that possible?
> > >
> > > --
> > > *This message is intended only for the use of the individual or entity
> to
> > > which it is addressed and may contain information that is privileged,
> > > confidential and exempt from disclosure under applicable law. If you
> have
> > > received this message in error, you are hereby notified that any use,
> > > dissemination, distribution or copying of this message is prohibited.
> If
> > > you have received this communication in error, please notify the sender
> > > immediately and destroy the transmitted information.*
> >
>
>
>
> --
>
> Best Regards,
>
> *Mark Johnson* | .NET Software Engineer
>
> Office: 603-392-7017
>
> Emerson Ecologics, LLC | 1230 Elm Street | Suite 301 | Manchester NH |
> 03101
>
>   
>
> *Supporting The Practice Of Healthy Living*
>
> 
> 
> 
> 
> 
> 
>  Ecologics-EI_IE388367.11,28.htm>
>
> --
> *This message is intended only for the use of the individual or entity to
> which it is addressed and may contain information that is privileged,
> confidential and exempt from disclosure under applicable law. If you have
> received this message in error, you are hereby notified that any use,
> dissemination, distribution or copying of this message is prohibited. If
> you have received this communication in error, please notify the sender
> immediately and destroy the transmitted information.*
>


Re: Solr Delete By Id Out of memory issue

2017-03-22 Thread Rohit Kanchan
For commits we are relying on auto commits. We have define following in
configs:

   

1

3

false





15000



One thing which I would like to mention is that we are not calling directly
deleteById from client. We have created an  update chain and added a
processor there. In this processor we are querying first and collecting all
byteRefHash and get each byteRef out of it and set it to indexedId. After
collecting indexedId we are using those ids to call delete byId. We are
doing this because we do not want query solr before deleting at client
side. It is possible that there is a bug in this code but I am not sure,
because when I run tests in my local it is not showing any issues. I am
trying to remote debug now.

Thanks
Rohit


On Wed, Mar 22, 2017 at 9:57 AM, Chris Hostetter 
wrote:

>
> : OK, The whole DBQ thing baffles the heck out of me so this may be
> : totally off base. But would committing help here? Or at least be worth
> : a test?
>
> ths isn't DBQ -- the OP specifically said deleteById, and that the
> oldDeletes map (only used for DBI) was the problem acording to the heap
> dumps they looked at.
>
> I suspect you are correct about the root cause of the OOMs ... perhaps the
> OP isn't using hard/soft commits effectively enough and the uncommitted
> data is what's causing the OOM ... hard to say w/o more details. or
> confirmation of exactly what the OP was looking at in their claim below
> about the heap dump
>
>
> : > : Thanks for replying. We are using Solr 6.1 version. Even I saw that
> it is
> : > : bounded by 1K count, but after looking at heap dump I was amazed how
> can it
> : > : keep more than 1K entries. But Yes I see around 7M entries according
> to
> : > : heap dump and around 17G of memory occupied by BytesRef there.
> : >
> : > what exactly are you looking at when you say you see "7M entries" ?
> : >
> : > are you sure you aren't confusing the keys in oldDeletes with other
> : > instances of BytesRef in the JVM?
>
>
> -Hoss
> http://www.lucidworks.com/
>


Re: Concatenating streams in streaming expressions

2017-03-22 Thread Joel Bernstein
There isn't a cat function yet. The closest function we have currently is a
merge function:

https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions#StreamingExpressions-merge

But I've been meaning to add a cat function so feel free to create the jira.


Joel Bernstein
http://joelsolr.blogspot.com/

On Wed, Mar 22, 2017 at 4:12 PM, Matt Magnusson 
wrote:

> Hello;
>
> Does anyone know of a way where I can concatenate source streams?
>
> For example if I have two searches
> search(prod,q="content:cat",fl="id,score",sort="score desc")
> search(prod,q="content:dog",fl="id,score",sort="score desc")
>
>
> Is there a way to have these come out as one stream. I've been trying
> to use the executor function by storing these searches as expr_s.  I
> however, can't figure out how to merge the output of these back into
> one stream.  If I run the following code,
>
> executor(search(queries, q="*:*",fl="id, expr_s", sort="id asc",
> qt="/export")). It gives this output:
>
> {
>   "result-set": {
> "docs": [
>   {
> "EOF": true,
> "RESPONSE_TIME": 32
>   }
> ]
>   }
> }
>
> So not the underlying tuples returned.
>
> I want it the return to be like this for all individual searches
> combined into one stream.
>
> {
>   "result-set": {
> "docs": [
>   {
> "score": 12.340755,
> "id": "9a49d7d6f5b3cc597f8e55e66bb6d96438b670d1"
>   },
>   {
> "score": 11.879734,
> "id": "887d349fc9390a87ac7fd4209af59af61531ad06"
>   },
>   {
> "score": 11.82577,
> "id": "c91971049ab95cb32dc2d0f8d616aad25ee04bb7"
>   },...
>
>
>
>
>  I know the searches are working correctly using the executor function
> because I can have them save output back to solr if I also include the
> update and commit functions in the expr_s field in my source queries
> collection.  Thanks
>
> Matt
>


Re: Custom FieldTypes

2017-03-22 Thread Alexandre Rafalovitch
You could provide the URP chain name (or individual URPs) when you
index a particular document type, but that requires you to send all
document types to put signature on together.

Or you could have a custom URP that skips other ones (they are
chained), though that's messier.

And I think you want overwriteDupes as "false" actually, otherwise URP
will delete the previous matching document.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 22 March 2017 at 15:46, Ronald Wood  wrote:
> Thanks. I had seen that page but had passed it over since I don’t want to do 
> de-duping (text fields with the exact same text are possible and not cause 
> for de-dupe).
>
> If I want just to store the signature, it looks like I define the 
> signatureField in the configuration and set overwriteDupes to true (since I 
> don’t actually regard them as dupes).
>
> I guess the one downside to this is that the processor will run regardless of 
> the document type (we have 6 types and only 3 need hashes on text). Or maybe 
> empty values for fields stops the processor? No signature is needed when the 
> text fields are not provided.
>
> -R
>
> On 3/22/17, 3:20 PM, "Alexandre Rafalovitch"  wrote:
>
> You'd use CloneField URP
> 
> http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/update/processor/CloneFieldUpdateProcessorFactory.html
>
> Then you do your custom algorithm. Or - as I just remembered - use one
> of the hash ones described in dedupe section:
> https://cwiki.apache.org/confluence/display/solr/De-Duplication (which
> don't see to require CloneField anyway).
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and experienced
>
>
> On 22 March 2017 at 14:55, Ronald Wood  wrote:
> > I suppose it could be, but the flexibility of using copy directives is 
> appealing for handling multiple fields as defined in the schema.
> >
> > Since I have rarely looked at the UpdateRequestProcessor, I guess I 
> don’t know if it could take multiple fields to hash, and if so how that would 
> be expressed.
> >
> > -R
> >
> > On 3/22/17, 2:21 PM, "Alexandre Rafalovitch"  wrote:
> >
> > Can this be done at the UpdateRequestProcessor stage?
> >
> > Regards,
> > Alex
> >
> >
> > On 22 Mar 2017 1:48 PM, "Ronald Wood"  wrote:
> >
> > I have been mulling over the usefulness of a new Hash field type 
> for being
> > able to validate data that is indexed but not stored. Basically, 
> I’d use
> > copy directives to copy all fields to be hashed to the new hash 
> field and
> > store a SHA-256 hash as a string. I’m still not sure how valuable 
> it would
> > for us. Maybe someone has already done something similar?
> >
> > However, I was wondering in general about how one would go about
> > implementing and integrating a few FieldType.
> >
> > Looking at UUIDField > master/solr/core/src/java/org/apache/solr/schema/UUIDField.java> as 
> an
> > example, the work seems moderate. But then the question is, how 
> would I
> > integrate it? Just drop in a new jar with the class or does it have 
> to be
> > integrated into Solr as a proper commit?
> >
> > If it were valuable for others, I would love to contribute it, 
> should we go
> > ahead with it. But I already have had trouble getting our Legal 
> Dept. to
> > give the go ahead to contribute the code that worked for re-indexing
> > docValues in place (SOLR-9437). ☹
> >
> > -Ronald S. Wood
> >
> >
>
>


Re: Custom FieldTypes

2017-03-22 Thread Ronald Wood
Thanks, Alex. I’ll experiment with it.

-R

On 3/22/17, 4:38 PM, "Alexandre Rafalovitch"  wrote:

You could provide the URP chain name (or individual URPs) when you
index a particular document type, but that requires you to send all
document types to put signature on together.

Or you could have a custom URP that skips other ones (they are
chained), though that's messier.

And I think you want overwriteDupes as "false" actually, otherwise URP
will delete the previous matching document.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 22 March 2017 at 15:46, Ronald Wood  wrote:
> Thanks. I had seen that page but had passed it over since I don’t want to 
do de-duping (text fields with the exact same text are possible and not cause 
for de-dupe).
>
> If I want just to store the signature, it looks like I define the 
signatureField in the configuration and set overwriteDupes to true (since I 
don’t actually regard them as dupes).
>
> I guess the one downside to this is that the processor will run 
regardless of the document type (we have 6 types and only 3 need hashes on 
text). Or maybe empty values for fields stops the processor? No signature is 
needed when the text fields are not provided.
>
> -R
>
> On 3/22/17, 3:20 PM, "Alexandre Rafalovitch"  wrote:
>
> You'd use CloneField URP
> 
http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/update/processor/CloneFieldUpdateProcessorFactory.html
>
> Then you do your custom algorithm. Or - as I just remembered - use one
> of the hash ones described in dedupe section:
> https://cwiki.apache.org/confluence/display/solr/De-Duplication (which
> don't see to require CloneField anyway).
>
> Regards,
>Alex.
> 
> http://www.solr-start.com/ - Resources for Solr users, new and 
experienced
>
>
> On 22 March 2017 at 14:55, Ronald Wood  wrote:
> > I suppose it could be, but the flexibility of using copy directives 
is appealing for handling multiple fields as defined in the schema.
> >
> > Since I have rarely looked at the UpdateRequestProcessor, I guess I 
don’t know if it could take multiple fields to hash, and if so how that would 
be expressed.
> >
> > -R
> >
> > On 3/22/17, 2:21 PM, "Alexandre Rafalovitch"  
wrote:
> >
> > Can this be done at the UpdateRequestProcessor stage?
> >
> > Regards,
> > Alex
> >
> >
> > On 22 Mar 2017 1:48 PM, "Ronald Wood"  wrote:
> >
> > I have been mulling over the usefulness of a new Hash field 
type for being
> > able to validate data that is indexed but not stored. 
Basically, I’d use
> > copy directives to copy all fields to be hashed to the new hash 
field and
> > store a SHA-256 hash as a string. I’m still not sure how 
valuable it would
> > for us. Maybe someone has already done something similar?
> >
> > However, I was wondering in general about how one would go about
> > implementing and integrating a few FieldType.
> >
> > Looking at UUIDField > 
master/solr/core/src/java/org/apache/solr/schema/UUIDField.java> as an
> > example, the work seems moderate. But then the question is, how 
would I
> > integrate it? Just drop in a new jar with the class or does it 
have to be
> > integrated into Solr as a proper commit?
> >
> > If it were valuable for others, I would love to contribute it, 
should we go
> > ahead with it. But I already have had trouble getting our Legal 
Dept. to
> > give the go ahead to contribute the code that worked for 
re-indexing
> > docValues in place (SOLR-9437). ☹
> >
> > -Ronald S. Wood
> >
> >
>
>




Re: SolrJ getHighlighting() does not return results in order

2017-03-22 Thread Bryan Bende
Hello,

I believe getHighlighting() returns Map>> .

Generally Maps are not expected to iterate in order unless you know
the underlying implementation of the Map, for example LinkedHashMap
will iterate in the insertion order and HashMap will not.

You should be able to take the doc id from one of the results in the
document list and then do getHighlighting().get(docid) to get the
Map> for the given
document.

Hope that helps.

-Bryan


On Wed, Mar 22, 2017 at 8:54 AM, leoperezpulido
 wrote:
> Hi,
>
> Implementing highlighting with *SolrJ* does not return results in the proper
> order while I "page" through results. This not seems to be a problem with
> the RESTful API.
>
> // ...
> query.setQuery("text");
> /*
> The problem is when I set start to get different "pages",
> the results returned by getHighlighting() are disordered.
> */
> query.setStart(0);
> query.setSort("score", SolrQuery.ORDER.desc);
> query.setIncludeScore(true);
>
> query.setHighlight(true);
> query.addHightlightField("content");
> // ...
>
> Take the example of a simple index with a field named content and field's
> values like:
> Document 1
> Document 2
> Document 3
> etc.
>
> With the results returned by SolrDocumentList and with the RESTful API, I
> can paginate in the normal way, and the results remain ordered. This is not
> the case when I get results from getHighlighting().
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrJ-getHighlighting-does-not-return-results-in-order-tp4326218.html
> Sent from the Solr - User mailing list archive at Nabble.com.


RE: Exception while integrating openNLP with Solr

2017-03-22 Thread Markus Jelsma
Hi - We don't use that OpenNLP patch, nor do we use such kind of lemmatizer. We 
just rely on POS-tagging via a CharFilter with custom trained maxent models and 
it is fast enough.

So, do you really need that analyzer that is giving you a hard time? I don't 
know what that lemmatizer does but you can get a really fine search engine with 
POS-tagging alone, and that is fast enough.

My question now is, why do you need that patch? What do you intend to do with 
it? Maybe you can get what you need with simpler things than that patch.

Regards,
Markus
 
-Original message-
> From:aruninfo100 
> Sent: Wednesday 22nd March 2017 19:15
> To: solr-user@lucene.apache.org
> Subject: RE: Exception while integrating openNLP with Solr
> 
> Hi,
> Thanks for the reply.
> 
> Kindly find  the filed type scghema i am using :
> 
>  
> 
> 
> Does the *opennlp_text* field be indexed="true"?
> 
>   positionIncrementGap="100">
>   
>  sentenceModel="opennlp/en-sent.bin"  tokenizerModel="opennlp/en-token.bin"/>
>  posTaggerModel="opennlp/en-pos-maxent.bin"/>
> dictionary="opennlp/en-lemmatizer.txt"/>
>   
> 
> 
> Here the en-lemmatizer.txt is 7mb in size.Without lemmatization usually the
> whole indexing process takes on an average basis 2-3mts,but here it is
> taking more than 1hr and continuing.Is the scenario related to the
> lemmatizer file.
> Could you please guide me.
> 
> Thanks,
> Arun
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Exception-while-integrating-openNLP-with-Solr-tp4326146p4326311.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 


Re: Regex Phrases

2017-03-22 Thread Erick Erickson
Susheel:

That'll work, but the options you've specified for
WordDelimiterFilterFactory pretty much make it so it's doing nothing.
I realize it's commented out...

That said, it's true that if you have a very specific pattern you want
to recognize a Regex can do the trick. WDFF is a bit more generic
though when you have less specific requirements.

Best,
Erick

On Wed, Mar 22, 2017 at 12:56 PM, Susheel Kumar  wrote:
> I have used PatternReplaceFilterFactory in some of these situations. e.g.
> below
>
>class="solr.PatternReplaceFilterFactory" pattern="(\d+)-(\d+)-?(\d+)$"
> replacement="$1$2$3"/>
>
> On Wed, Mar 22, 2017 at 2:54 PM, Mark Johnson > wrote:
>
>> Awesome, thank you much!
>>
>> On Wed, Mar 22, 2017 at 2:38 PM, Erick Erickson 
>> wrote:
>>
>> > Take a close look at WordDelimiterFilterFactory, it's designed to deal
>> > with things like part numbers, phone numbers and the like, and the
>> > example you gave is in the same class of problem I think. It'll take
>> > a bit to get your head around what it does, but it'll perfom better
>> > than regexes, assuming you can get what you need out of it.
>> >
>> > And the admin/analysis page will help you _greatly_ in understanding
>> > what the effects of the various parameters are.
>> >
>> > Best,
>> > Erick
>> >
>> > On Wed, Mar 22, 2017 at 11:06 AM, Mark Johnson
>> >  wrote:
>> > > Is it possible to configure Solr to treat text that matches a regex as
>> a
>> > > phrase?
>> > >
>> > > I have a database full of products, and the Title and Description
>> fields
>> > > are text_en, tokenized via the StandardTokenizerFactory. This works in
>> > most
>> > > cases, but a number of products have names like:
>> > >
>> > >  - Vitamin A
>> > >  - Vitamin-A
>> > >  - Vitamin B12
>> > >  - Vitamin B-12
>> > > ...and so on
>> > >
>> > > I have a regex that will match all of the permutations and would like
>> to
>> > > configure the field type so that anything that matches the regex
>> pattern
>> > is
>> > > treated as a single token, instead of being broken up by spaces, etc.
>> Is
>> > > that possible?
>> > >
>> > > --
>> > > *This message is intended only for the use of the individual or entity
>> to
>> > > which it is addressed and may contain information that is privileged,
>> > > confidential and exempt from disclosure under applicable law. If you
>> have
>> > > received this message in error, you are hereby notified that any use,
>> > > dissemination, distribution or copying of this message is prohibited.
>> If
>> > > you have received this communication in error, please notify the sender
>> > > immediately and destroy the transmitted information.*
>> >
>>
>>
>>
>> --
>>
>> Best Regards,
>>
>> *Mark Johnson* | .NET Software Engineer
>>
>> Office: 603-392-7017
>>
>> Emerson Ecologics, LLC | 1230 Elm Street | Suite 301 | Manchester NH |
>> 03101
>>
>>   
>>
>> *Supporting The Practice Of Healthy Living*
>>
>> 
>> 
>> 
>> 
>> 
>> 
>> > Ecologics-EI_IE388367.11,28.htm>
>>
>> --
>> *This message is intended only for the use of the individual or entity to
>> which it is addressed and may contain information that is privileged,
>> confidential and exempt from disclosure under applicable law. If you have
>> received this message in error, you are hereby notified that any use,
>> dissemination, distribution or copying of this message is prohibited. If
>> you have received this communication in error, please notify the sender
>> immediately and destroy the transmitted information.*
>>


RE: Exception while integrating openNLP with Solr

2017-03-22 Thread aruninfo100
Hi,

I applied the LUCENE-2899.patch which provide the openNLP capabilities to
solr for nlp capabilities.One such feature it provides is
lemmatization,which helps to match the root word.But integrating the same
was too much time consuming(indexing). It provides you with POS,Sentence
detection,Named entity recognition too.As u said here too models has to be
trained for better performance.

I am also trying to use  POS-tagging:

 

I tried analyzing the output through this filter from solr admin UI and I
could see the tagging.
I haven't trained the model-en-pos-maxent.bin as of now.

It will be helpful if you can help me in providing details on:
I can build on top of the knowledge provided by you on:

1.How good the training data should be.Things to be noticed.
2.Training tool you have used.openNLP provides command line interface for
training and also APIs.
3.The schema structure to follow.
4.Query structure.

Thanks once again for spending time on my queries :) .

Thanks and Regards,
Arun



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Exception-while-integrating-openNLP-with-Solr-tp4326146p4326387.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SolrJ getHighlighting() does not return results in order

2017-03-22 Thread leoperezpulido
Hello,


Yes, getHightlighting() returns a Map>>, so I 
first get the map = response.getHightlighting();


Then I initialize a TreeMap with the map object just obtained above (new 
TreeMap<>(map)). I then get a collection-view of this treeMap object, like: set 
= treeMap.entrySet();


To obtain the order that corresponds with the order returned from a 
SolrDocumentList I initialized an ArrayList dynamically, as in:


ArrayList highlightResults = new ArrayList<>();

for(Map.Entry>> me : set) {

highlightResults.add(me.getValue().get("content").get(0));

}


And with that I obtained the desired results.


Thanks for your help.


Leonardo.




From: Bryan Bende [via Lucene] 
Sent: 22 March 2017 17:44:18
To: leoperezpulido
Subject: Re: SolrJ getHighlighting() does not return results in order

Hello,

I believe getHighlighting() returns Map>> .

Generally Maps are not expected to iterate in order unless you know
the underlying implementation of the Map, for example LinkedHashMap
will iterate in the insertion order and HashMap will not.

You should be able to take the doc id from one of the results in the
document list and then do getHighlighting().get(docid) to get the
Map> for the given
document.

Hope that helps.

-Bryan


On Wed, Mar 22, 2017 at 8:54 AM, leoperezpulido
<[hidden email]> wrote:

> Hi,
>
> Implementing highlighting with *SolrJ* does not return results in the proper
> order while I "page" through results. This not seems to be a problem with
> the RESTful API.
>
> // ...
> query.setQuery("text");
> /*
> The problem is when I set start to get different "pages",
> the results returned by getHighlighting() are disordered.
> */
> query.setStart(0);
> query.setSort("score", SolrQuery.ORDER.desc);
> query.setIncludeScore(true);
>
> query.setHighlight(true);
> query.addHightlightField("content");
> // ...
>
> Take the example of a simple index with a field named content and field's
> values like:
> Document 1
> Document 2
> Document 3
> etc.
>
> With the results returned by SolrDocumentList and with the RESTful API, I
> can paginate in the normal way, and the results remain ordered. This is not
> the case when I get results from getHighlighting().
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/SolrJ-getHighlighting-does-not-return-results-in-order-tp4326218.html
> Sent from the Solr - User mailing list archive at Nabble.com.



If you reply to this email, your message will be added to the discussion below:
http://lucene.472066.n3.nabble.com/SolrJ-getHighlighting-does-not-return-results-in-order-tp4326218p4326379.html
To unsubscribe from SolrJ getHighlighting() does not return results in order, 
click 
here.
NAML




--
View this message in context: 
http://lucene.472066.n3.nabble.com/SolrJ-getHighlighting-does-not-return-results-in-order-tp4326218p4326388.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: block join - search together at parent and childern

2017-03-22 Thread Jan Nekuda
Hi Mikhail,
thank you very much - it's exactly what I need. When I have tried it first
a had problem with spaces and it seems that it doesn't work, but now it
works great.

Thanks and have a nice day
Jan



2017-03-21 10:11 GMT+01:00 Mikhail Khludnev :

> Hello Jan,
> If I get you right, you need to find every word either in parent or child
> level, hence:
>
> q=+({!edismax qf=$pflds v=$w1} {!parent ..}{!edismax qf=$cflds v=$w1})
> +({!edismax qf=$pflds v=$w1} {!parent ..}{!edismax qf=$cflds
> v=$w1})...&w1=foo&w2=bar
> note that spaces and + matter much. This yields cross-matches, but you
> probably don't bother about them.
>
> On Sun, Mar 19, 2017 at 11:58 AM, Jan Nekuda  wrote:
>
> > Hi Michael,
> > thank you for fast answer - I have tried it, but it's not exactly what I
> > need. I hope that I understood it good - the problem is that if I will
> > write foo bar and foo bar is not found in root entity then it returns
> > nothing even if any field in children contains foo bar.
> > I need to write foo bar and find all documents where foo bar exists in
> > document A OR B OR C OR D even if in A will have FOO and in e.g C will be
> > bar. But if I will write bar of chocolate then I need return nothing.
> >
> > my idea was to use
> > edismax and filter query for each word:
> > http://localhost:8983/solr/demo/select?q=*:*&fq={!parent
> > which=type:root}foo*&fq={!parent
> > which=typ:root}bar*&wt=json&indent=true&defType=edismax&
> > qf=$allfields&stopwords=true&lowercaseOperators=true&allfieldscolor,
> > first_country, power, name, country
> >
> > the problem is that I'm not able to find also parent documents in one
> > condition with children.
> >
> > How I wrote I'm able solve it with another parent and then also doc A
> will
> > be child and everything will work fine - but I would like to solve it
> > better.
> >
> >
> > Do you have or someone else another idea?:)
> >
> > Thanks
> > Jan
> >
> >
> > 2017-03-16 21:51 GMT+01:00 Mikhail Khludnev :
> >
> > > Hello Jan,
> > >
> > > What if you combine child and parent dismaxes like below
> > > q={!edismax qf=$parentfields}foo bar {!parent ..}{!dismax
> qf=$childfields
> > > v=$childclauses}&childclauses=foo bar +type:child&parentfields=...&
> > > parentfields=...
> > >
> > > On Thu, Mar 16, 2017 at 10:54 PM, Jan Nekuda 
> > wrote:
> > >
> > > > Hello Mikhail,
> > > >
> > > > thanks for fast answer. The problem is, that I want to have the
> dismax
> > on
> > > > child and parent together - to have the filter evaluated together.
> > > >
> > > > I need to have documents:
> > > >
> > > >
> > > > path: car
> > > >
> > > > type:car
> > > >
> > > > color:red
> > > >
> > > > first_country: CZ
> > > >
> > > > name:seat
> > > >
> > > >
> > > >
> > > > path: car\engine
> > > >
> > > > type:engine
> > > >
> > > > power:63KW
> > > >
> > > >
> > > >
> > > > path: car\engine\manufacturer
> > > >
> > > > type:manufacturer
> > > >
> > > > name: xx
> > > >
> > > > country:PL
> > > >
> > > >
> > > > path: car
> > > >
> > > > type:car
> > > >
> > > > color:green
> > > >
> > > > first_country: CZ
> > > >
> > > > name:skoda
> > > >
> > > >
> > > >
> > > > path: car\engine
> > > >
> > > > type:engine
> > > >
> > > > power:88KW
> > > >
> > > >
> > > >
> > > > path: car\engine\manufacturer
> > > >
> > > > type:manufacturer
> > > >
> > > > name: yy
> > > >
> > > > country:PL
> > > >
> > > >
> > > > where car is parent document engine is its child a manufacturer is
> > child
> > > > of engine and the structure can be deep.
> > > >
> > > > I need to make a query with edismax over fields color, first_country,
> > > > power, name, country over parent and all childern.
> > > >
> > > > when I ask then "seat 63 kw" i need to get seat car
> > > >
> > > > the same if I will write only "seat" or only "63kw" or only "xx"
> > > >
> > > > but if I will write "seat 88kw" i expect that i will get no result
> > > >
> > > > I need to return parents in which tree are all the words which I
> wrote
> > to
> > > > query.
> > > >
> > > > How I wrote before my solution was to split the query text and use
> > q:*:*
> > > > and for each /word/ in query make
> > > >
> > > > fq={!parent which=type:car}/word//
> > > > /
> > > >
> > > > //and edismax with qf=color, first_country, power, name, country
> > > >
> > > > Thank you for your time:)
> > > >
> > > > Jan
> > > >
> > > >
> > > > Dne 16.03.2017 v 20:00 Mikhail Khludnev napsal(a):
> > > >
> > > >
> > > > Hello,
> > > >>
> > > >> It's hard to get into the problem. but you probably want to have
> > dismax
> > > on
> > > >> child level:
> > > >> q={!parent ...}{!edismax qf='childF1 childF2' v=$chq}&chq=foo bar
> > > >> It's usually broken because child query might match parents which is
> > not
> > > >> allowed. Thus, it's probably can solved by adding +type:child into
> > chq.
> > > >> IIRC edismax supports lucene syntax.
> > > >>
> > > >> On Thu, Mar 16, 2017 at 4:47 PM, Jan Nekuda 
> > > wrote:
> > > >>
> > > >> Hi,
> > > >>> I have a question for which I wasn'

to handle expired documents: collection alias or delete by id query

2017-03-22 Thread Derek Poh

Hi

I have collections of products. I am doing indexing 3-4 times daily.
Every day there are products that expired and I need to remove them from 
these collectionsdaily.


Ican think of 2 ways to do this.
1. using collection aliasto switch between a main and temp collection.
- clear and index the temp collection
- create alias to temp collection.
- clear and index the main collection.
- create alias to main collection.

this way require additional collections.

2. get list of expired products and generate deleteby id queries to the 
collections.


Would like to get some advice on which way should I adopt?


Derek

--
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 


This e-mail and any reply to it may be monitored for security, legal, 
regulatory compliance and/or other appropriate reasons.