AW: AW: SolrClient#updateByQuery?

2018-01-27 Thread Clemens Wyss DEV

Thanks for all these (main contributor's 😉) valuable inputs!

First thing I did was getting rid of "expungeDeletes". My "single-deletion" 
unittest failed until I added the optimize-param
> updateReques.setParam( "optimize", "true" );
Does this make sense or should JIRA it? 
How expensive is this "optimization"?
BTW: we are on Solr 6.6.0

-Ursprüngliche Nachricht-
Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch] 
Gesendet: Samstag, 27. Januar 2018 08:50
An: 'solr-user@lucene.apache.org' 
Betreff: AW: AW: SolrClient#updateByQuery?

Thanks for all these (main contributor's 😉) valuable inputs!

First thing I did was getting getting rid of "expungeDeletes". My 
"single-deletion" unittest failed unti I added the optimize-param
> updateReques.setParam( "optimize", "true" );
Does this make sense or should JIRA it? 
How expensive ist this "optimization"?


-Ursprüngliche Nachricht-
Von: Shawn Heisey [mailto:apa...@elyograg.org] 
Gesendet: Samstag, 27. Januar 2018 00:49
An: solr-user@lucene.apache.org
Betreff: Re: AW: SolrClient#updateByQuery?

On 1/26/2018 9:55 AM, Clemens Wyss DEV wrote:
> Why do I want to do all this (dumb things)? The context is as follows:
> when a document is deleted in an index/core this deletion is not immediately 
> reflected in the searchresults. Deletions at not really NRT (or has this 
> changed?). Till now we "solved" this brutely by forcing a commit (with 
> "expunge deletes"), till we noticed that this results in quite a "heavy 
> load", to say the least.
> Now I have the idea to add a "deleted"-flag to all the documents that is 
> filtered on on all queries.
> When it comes to deletions, I would upate the document's deleted flag and 
> then effectively delete it. For single deletion this is ok, but what if I 
> need to re-index?

The deleteByQuery functionality is known to have some issues getting along with 
other things happening at the same time.

For best performance and compatibility with concurrent operations, I would 
strongly recommend that you change all deleteByQuery calls into two steps:  Do 
a standard query with fl=id (or whatever your uniqueKey field is), gather up 
the ID values (possibly with start/rows pagination or cursorMark), and then 
proceed to do one or more deleteById calls with those ID values.  Both the 
query and the ID-based delete can coexist with other concurrent operations very 
well.

I would expect that doing atomic updates to a deleted field in your documents 
is going to be slower than the query/deleteById approach.  I cannot be sure 
this is the case, but I think it would be.  It should be a lot more friendly to 
NRT operation than deleteByQuery.

As Walter said, expungeDeletes will result in Solr doing a lot more work than 
it should, slowing things down even more.  It also won't affect search results 
at all.  Once the commit finishes and opens a new searcher, Solr will not 
include deleted documents in search results. The expungeDeletes parameter can 
make commits take a VERY long time.

I have no idea whether the issues surrounding deleteByQuery can be fixed or not.

Thanks,
Shawn



RE: 7.2.1 cluster dies within minutes after restart

2018-01-27 Thread Markus Jelsma
Hello,

I grepped for it yesterday and found nothing but 3 in the settings, but 
judging from the weird time out value, you may be right. Let me apply your 
patch early next week and check for spurious warnings.

Another note worthy observation for those working on cloud stability and 
recovery, whenever this happens, some nodes are also absolutely sure to run 
OOM. The leaders usually live longest, the replica's don't, their heap usage 
peaks every time, consistently. 

Thanks,
Markus
 
-Original message-
> From:Shawn Heisey 
> Sent: Saturday 27th January 2018 0:49
> To: solr-user@lucene.apache.org
> Subject: Re: 7.2.1 cluster dies within minutes after restart
> 
> On 1/26/2018 10:02 AM, Markus Jelsma wrote:
> > o.a.z.ClientCnxn Client session timed out, have not heard from server in 
> > 22130ms (although zkClientTimeOut is 3).
> 
> Are you absolutely certain that there is a setting for zkClientTimeout
> that is actually getting applied?  The default value in Solr's example
> configs is 30 seconds, but the internal default in the code (when no
> configuration is found) is still 15.  I have confirmed this in the code.
> 
> Looks like SolrCloud doesn't log the values it's using for things like
> zkClientTimeout.  I think it should.
> 
> https://issues.apache.org/jira/browse/SOLR-11915
> 
> Thanks,
> Shawn
> 
> 


Using replicas in SOLR-6.5.1

2018-01-27 Thread SOLR4189
I use SOLR-6.5.1. I would like to use SolrCloud replicas. And I have some
questions:

1) What is the best architecture for this if my collection contains 20
shards, and each shard is in different vm? 40 vms where 20 for leaders and
20 for replicas? Or maybe stay with 20 vms where leader and replica (of
another leader) in the same vm but to add RAM?

2) What are opened issues about replicas in SOLR-6.5.1 that I need to check?

3) If I use SolrCloud replica, which configuration parameters should I
change? Which can I change?



--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: AW: AW: SolrClient#updateByQuery?

2018-01-27 Thread Shawn Heisey

On 1/27/2018 12:49 AM, Clemens Wyss DEV wrote:

Thanks for all these (main contributor's 😉) valuable inputs!

First thing I did was getting getting rid of "expungeDeletes". My 
"single-deletion" unittest failed unti I added the optimize-param

updateReques.setParam( "optimize", "true" );

Does this make sense or should JIRA it?
How expensive ist this "optimization"?


An optimize operation is a complete rewrite of the entire index to one 
segment.  It will typically double the size of the index.  The rewritten 
index will not have any documents that were deleted in it.  It's slow 
and extremely expensive.  If the index is one gigabyte, expect an 
optimize to take at least half an hour, possibly longer, to complete. 
The CPU and disk I/O are going to take a beating while the optimize is 
occurring.


Thanks,
Shawn


Re: Using replicas in SOLR-6.5.1

2018-01-27 Thread Sameer Maggon
1. You could just have 2 VMs, one has all 20 shards of your collection, the
other one has the replicas for those shards. In this scenario, if one VM is
not available, you still have application availability as at least one
replica is available for each shard. This assumes that your VM can fit all
the data in one VM (all 20 shards) without compromising on performance or
getting into memory or garbage collection issues (I am not sure what the
size of your collection or shards is). For additional redundancy, you can
add another VM and add another replica for for all your shards.

2. Can you provide more specifics around what sort of issues are you
thinking of? Replication in general is pretty solid in the version you are
talking about. You could comb through JIRA (
https://issues.apache.org/jira/browse/SOLR-5821?jql=project%20%3D%20SOLR%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22replica%22
)

3. I would recommend you take a look at the Solr Collection API (
https://lucene.apache.org/solr/guide/6_6/collections-api.html). Parameters
that you want to pay more attention to are "replicationFactor", "numShards"
and "maxShardsPerNode" that relate to the shards and replicas.

If you have a use case that warrants you to go beyond the above scenario of
having all shards on the same VM, then you should read more into
"maxShardsPerNode", etc. - but perhaps you can share a bit more around that
use that.

Thanks,
-- 
Sameer Maggon
https://www.searchstax.com | Solr-as-as-Service platform on AWS, Azure and
GCP

On Sat, Jan 27, 2018 at 2:08 AM, SOLR4189  wrote:

> I use SOLR-6.5.1. I would like to use SolrCloud replicas. And I have some
> questions:
>
> 1) What is the best architecture for this if my collection contains 20
> shards, and each shard is in different vm? 40 vms where 20 for leaders and
> 20 for replicas? Or maybe stay with 20 vms where leader and replica (of
> another leader) in the same vm but to add RAM?
>
> 2) What are opened issues about replicas in SOLR-6.5.1 that I need to
> check?
>
> 3) If I use SolrCloud replica, which configuration parameters should I
> change? Which can I change?
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: Using replicas in SOLR-6.5.1

2018-01-27 Thread SOLR4189
1. You are right, due to memory and garbage collection issues I set each
shard to different VM. So in my VM I has 50 GB RAM (10 GB for JVM and 40 GB
for index) and it works good for my using case. Maybe I don't understand
solr terms, but if you say to set one VM for 20 shards what does it mean? 20
nodes or 20 JVMs or 20 solr instances on the same virtual server? Can you
explain what did you mean?

2. I speak about like issues: "facet perfomance regression" or "using ltr
with grouping" or "using timeAllowed with grouping". Something that will
stop me to use replicas feature. Sometimes I don't understand solr issues,
for example, if bug is unresolved and affects version 4.10 and fix version
none, what does it mean? This bug can happen in solr-6.5.1 also?

3. Yes, I'm familiar with the Solr Collection API.

I preferred to set each shard to different small VMs. 

Just make sure with you *one solr node = one JVM = one solr instance = one
or many shards?
*

Thank you.




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Re: ***UNCHECKED*** Limit Solr search to number of character/words (without changing index)

2018-01-27 Thread Muhammad Zahid Iqbal
Thanks.

I do not want to search if the query is shorter than a certain number of
terms/characters.

For example, I have a 10MB document indexed in Solr what I want is to
search query in first 1MB content of that indexed document.

Any workaround e.g .can I send query to Solr to look for only 1MB from
start of document.?



On Fri, Jan 26, 2018 at 10:46 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) <
dceccarel...@bloomberg.net> wrote:

> Hi Zahid, if you want to allow searching only if the query is shorter than
> a certain number of terms / characters, I would do it before calling solr
> probably, otherwise you could write a QueryParserPlugin (see [1]) and check
> that the query is sound before processing it.
> See also: http://coding-art.blogspot.co.uk/2016/05/writing-custom-
> solr-query-parser-for.html
>
> Cheers,
> Diego
>
> [1] https://wiki.apache.org/solr/SolrPlugins
>
>
> From: solr-user@lucene.apache.org At: 01/26/18 13:24:36To:
> solr-user@lucene.apache.org
> Cc:  apa...@elyograg.org
> Subject: ***UNCHECKED*** Limit Solr search to number of character/words
> (without changing index)
>
> Hi All,
>
> Is there any way I can restrict Solr search query to look for specified
> number of characters/words (for only searching purposes not for
> highlighting)
>
> *For example:*
>
> *Indexed content:*
> *I am a man of my words I am a lazy man...*
>
> Search to consider only below mentioned (words=7 or characters=16)
> *I am a man of my words*
>
> If I search for *lazy *no record should find.
> If I search for *a *1 record should find.
>
>
> Thanks
> Zahid Iqbal
>
>
>


AW: AW: AW: SolrClient#updateByQuery?

2018-01-27 Thread Clemens Wyss DEV
Erick said/wrote:
> If you commit after docs are deleted and _still_ see them in search results, 
> that's a JIRA
should I JIRA it?

-Ursprüngliche Nachricht-
Von: Shawn Heisey [mailto:apa...@elyograg.org] 
Gesendet: Samstag, 27. Januar 2018 12:05
An: solr-user@lucene.apache.org
Betreff: Re: AW: AW: SolrClient#updateByQuery?

On 1/27/2018 12:49 AM, Clemens Wyss DEV wrote:
> Thanks for all these (main contributor's 😉) valuable inputs!
> 
> First thing I did was getting getting rid of "expungeDeletes". My 
> "single-deletion" unittest failed unti I added the optimize-param
>> updateReques.setParam( "optimize", "true" );
> Does this make sense or should JIRA it?
> How expensive ist this "optimization"?

An optimize operation is a complete rewrite of the entire index to one segment. 
 It will typically double the size of the index.  The rewritten index will not 
have any documents that were deleted in it.  It's slow and extremely expensive. 
 If the index is one gigabyte, expect an optimize to take at least half an 
hour, possibly longer, to complete. 
The CPU and disk I/O are going to take a beating while the optimize is 
occurring.

Thanks,
Shawn


Re: AW: AW: SolrClient#updateByQuery?

2018-01-27 Thread Erick Erickson
Clemens:

Let's not raise a JIRA quite yet. I am 99% sure your test is not doing
what you think or you have some invalid expectations. This is such a
fundamental feature that it'd surprise me a _lot_ if it were a bug.
Also, there are a bunch of DeleteByQuery tests in the junit tests
that's run all the time..

Wait, are you issuing an explicit commit or not? I saw this phrase
"...brutely by forcing a commit (with "expunge deletes")..." and saw
the word "commit" and assumed you were issuing a commit, but
re-reading that's not clear at all. Code should look something like

update-via-delete-by-query
solrClient.commit();
query to see if doc is gone.

So here's what I'd try next:

1> Issue an explicit commit command (SolrCient.commit()) after the
DBQ. The defaults there are openSearcher = true and waitSearcher=
true. When that returns _then_ issue your query.
2> If that doesn't work, try (just for information gathering) waiting,
several seconds after the commit to try your request. This should
_not_ be necessary, but it'll give us a clue what's going on.
3> Show us the code if you can.

Best,
Erick


On Sat, Jan 27, 2018 at 6:55 AM, Clemens Wyss DEV  wrote:
> Erick said/wrote:
>> If you commit after docs are deleted and _still_ see them in search results, 
>> that's a JIRA
> should I JIRA it?
>
> -Ursprüngliche Nachricht-
> Von: Shawn Heisey [mailto:apa...@elyograg.org]
> Gesendet: Samstag, 27. Januar 2018 12:05
> An: solr-user@lucene.apache.org
> Betreff: Re: AW: AW: SolrClient#updateByQuery?
>
> On 1/27/2018 12:49 AM, Clemens Wyss DEV wrote:
>> Thanks for all these (main contributor's 😉) valuable inputs!
>>
>> First thing I did was getting getting rid of "expungeDeletes". My
>> "single-deletion" unittest failed unti I added the optimize-param
>>> updateReques.setParam( "optimize", "true" );
>> Does this make sense or should JIRA it?
>> How expensive ist this "optimization"?
>
> An optimize operation is a complete rewrite of the entire index to one 
> segment.  It will typically double the size of the index.  The rewritten 
> index will not have any documents that were deleted in it.  It's slow and 
> extremely expensive.  If the index is one gigabyte, expect an optimize to 
> take at least half an hour, possibly longer, to complete.
> The CPU and disk I/O are going to take a beating while the optimize is 
> occurring.
>
> Thanks,
> Shawn


Re: ***UNCHECKED*** Limit Solr search to number of character/words (without changing index)

2018-01-27 Thread Erick Erickson
Sure, use TruncateFieldUpdateProcessorFactory in your update chain,
here's the base definition:

  

  trunc
  5

  

This _can_ be configured to operate on "all StrField", or "all
TextFields" as well, see the Javadocs.

This is static, that is the field is truncated at index time so you
can't change the values per-request.

Best,
Erick



On Sat, Jan 27, 2018 at 6:32 AM, Muhammad Zahid Iqbal
 wrote:
> Thanks.
>
> I do not want to search if the query is shorter than a certain number of
> terms/characters.
>
> For example, I have a 10MB document indexed in Solr what I want is to
> search query in first 1MB content of that indexed document.
>
> Any workaround e.g .can I send query to Solr to look for only 1MB from
> start of document.?
>
>
>
> On Fri, Jan 26, 2018 at 10:46 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) <
> dceccarel...@bloomberg.net> wrote:
>
>> Hi Zahid, if you want to allow searching only if the query is shorter than
>> a certain number of terms / characters, I would do it before calling solr
>> probably, otherwise you could write a QueryParserPlugin (see [1]) and check
>> that the query is sound before processing it.
>> See also: http://coding-art.blogspot.co.uk/2016/05/writing-custom-
>> solr-query-parser-for.html
>>
>> Cheers,
>> Diego
>>
>> [1] https://wiki.apache.org/solr/SolrPlugins
>>
>>
>> From: solr-user@lucene.apache.org At: 01/26/18 13:24:36To:
>> solr-user@lucene.apache.org
>> Cc:  apa...@elyograg.org
>> Subject: ***UNCHECKED*** Limit Solr search to number of character/words
>> (without changing index)
>>
>> Hi All,
>>
>> Is there any way I can restrict Solr search query to look for specified
>> number of characters/words (for only searching purposes not for
>> highlighting)
>>
>> *For example:*
>>
>> *Indexed content:*
>> *I am a man of my words I am a lazy man...*
>>
>> Search to consider only below mentioned (words=7 or characters=16)
>> *I am a man of my words*
>>
>> If I search for *lazy *no record should find.
>> If I search for *a *1 record should find.
>>
>>
>> Thanks
>> Zahid Iqbal
>>
>>
>>


HDFS replication factor

2018-01-27 Thread Hendrik Haddorp

Hi,

when I configure my HDFS setup to use a specific replication factor, 
like 1, this only effects the index files that Solr writes. The 
write.lock files and backups are being created with a different 
replication factor. The reason for this should be that HdfsFileWriter is 
loading the defaults from the server 
(fileSystem.getServerDefaults(path)) while HdfsLockFactory and 
HdfsBackupRepository are simply using defaults, which seems to end up 
using a replication factor of 3 (and a block size of 128MB). Is this 
known? If not shall I open a JIRA for this?


regards,
Hendrik


Facing issue while writing more than one DIH for a core.

2018-01-27 Thread Sanjeet Kumar
Hi All,

Below is the DIH configurations for the Data import handlers for a core.




*For DIH-1:*



https://stackoverflow.com/feeds/tag/solr";
  processor="XPathEntityProcessor"
  dataSource="URLDataSource"
  forEach="/feed|/feed/entry"
  transformer="HTMLStripTransformer,RegexTransformer">

  

  *<**field name="dih_type" value="Feed"/>*















*For DiH-2:*

http://127.0.0.1:9983/solr/briefs2 "
  query="*:*"
  fl="id,title,lead,d_company,d_industry,d_location,d_created_on,d_updated_on">



  *<**field name="dih_type" value="Solr"/>*

















*The problems i am facing is follows:*

1. *I am not able to set field without column attribute.*

*   <**field name="dih_type" value="Feed"/>*

*  Is there any other way to do this?*

2. *How can i set authentication details for both Data import Handlers?*


Regards,

Sanjeet.