AW: AW: SolrClient#updateByQuery?
Thanks for all these (main contributor's 😉) valuable inputs! First thing I did was getting rid of "expungeDeletes". My "single-deletion" unittest failed until I added the optimize-param > updateReques.setParam( "optimize", "true" ); Does this make sense or should JIRA it? How expensive is this "optimization"? BTW: we are on Solr 6.6.0 -Ursprüngliche Nachricht- Von: Clemens Wyss DEV [mailto:clemens...@mysign.ch] Gesendet: Samstag, 27. Januar 2018 08:50 An: 'solr-user@lucene.apache.org' Betreff: AW: AW: SolrClient#updateByQuery? Thanks for all these (main contributor's 😉) valuable inputs! First thing I did was getting getting rid of "expungeDeletes". My "single-deletion" unittest failed unti I added the optimize-param > updateReques.setParam( "optimize", "true" ); Does this make sense or should JIRA it? How expensive ist this "optimization"? -Ursprüngliche Nachricht- Von: Shawn Heisey [mailto:apa...@elyograg.org] Gesendet: Samstag, 27. Januar 2018 00:49 An: solr-user@lucene.apache.org Betreff: Re: AW: SolrClient#updateByQuery? On 1/26/2018 9:55 AM, Clemens Wyss DEV wrote: > Why do I want to do all this (dumb things)? The context is as follows: > when a document is deleted in an index/core this deletion is not immediately > reflected in the searchresults. Deletions at not really NRT (or has this > changed?). Till now we "solved" this brutely by forcing a commit (with > "expunge deletes"), till we noticed that this results in quite a "heavy > load", to say the least. > Now I have the idea to add a "deleted"-flag to all the documents that is > filtered on on all queries. > When it comes to deletions, I would upate the document's deleted flag and > then effectively delete it. For single deletion this is ok, but what if I > need to re-index? The deleteByQuery functionality is known to have some issues getting along with other things happening at the same time. For best performance and compatibility with concurrent operations, I would strongly recommend that you change all deleteByQuery calls into two steps: Do a standard query with fl=id (or whatever your uniqueKey field is), gather up the ID values (possibly with start/rows pagination or cursorMark), and then proceed to do one or more deleteById calls with those ID values. Both the query and the ID-based delete can coexist with other concurrent operations very well. I would expect that doing atomic updates to a deleted field in your documents is going to be slower than the query/deleteById approach. I cannot be sure this is the case, but I think it would be. It should be a lot more friendly to NRT operation than deleteByQuery. As Walter said, expungeDeletes will result in Solr doing a lot more work than it should, slowing things down even more. It also won't affect search results at all. Once the commit finishes and opens a new searcher, Solr will not include deleted documents in search results. The expungeDeletes parameter can make commits take a VERY long time. I have no idea whether the issues surrounding deleteByQuery can be fixed or not. Thanks, Shawn
RE: 7.2.1 cluster dies within minutes after restart
Hello, I grepped for it yesterday and found nothing but 3 in the settings, but judging from the weird time out value, you may be right. Let me apply your patch early next week and check for spurious warnings. Another note worthy observation for those working on cloud stability and recovery, whenever this happens, some nodes are also absolutely sure to run OOM. The leaders usually live longest, the replica's don't, their heap usage peaks every time, consistently. Thanks, Markus -Original message- > From:Shawn Heisey > Sent: Saturday 27th January 2018 0:49 > To: solr-user@lucene.apache.org > Subject: Re: 7.2.1 cluster dies within minutes after restart > > On 1/26/2018 10:02 AM, Markus Jelsma wrote: > > o.a.z.ClientCnxn Client session timed out, have not heard from server in > > 22130ms (although zkClientTimeOut is 3). > > Are you absolutely certain that there is a setting for zkClientTimeout > that is actually getting applied? The default value in Solr's example > configs is 30 seconds, but the internal default in the code (when no > configuration is found) is still 15. I have confirmed this in the code. > > Looks like SolrCloud doesn't log the values it's using for things like > zkClientTimeout. I think it should. > > https://issues.apache.org/jira/browse/SOLR-11915 > > Thanks, > Shawn > >
Using replicas in SOLR-6.5.1
I use SOLR-6.5.1. I would like to use SolrCloud replicas. And I have some questions: 1) What is the best architecture for this if my collection contains 20 shards, and each shard is in different vm? 40 vms where 20 for leaders and 20 for replicas? Or maybe stay with 20 vms where leader and replica (of another leader) in the same vm but to add RAM? 2) What are opened issues about replicas in SOLR-6.5.1 that I need to check? 3) If I use SolrCloud replica, which configuration parameters should I change? Which can I change? -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: AW: AW: SolrClient#updateByQuery?
On 1/27/2018 12:49 AM, Clemens Wyss DEV wrote: Thanks for all these (main contributor's 😉) valuable inputs! First thing I did was getting getting rid of "expungeDeletes". My "single-deletion" unittest failed unti I added the optimize-param updateReques.setParam( "optimize", "true" ); Does this make sense or should JIRA it? How expensive ist this "optimization"? An optimize operation is a complete rewrite of the entire index to one segment. It will typically double the size of the index. The rewritten index will not have any documents that were deleted in it. It's slow and extremely expensive. If the index is one gigabyte, expect an optimize to take at least half an hour, possibly longer, to complete. The CPU and disk I/O are going to take a beating while the optimize is occurring. Thanks, Shawn
Re: Using replicas in SOLR-6.5.1
1. You could just have 2 VMs, one has all 20 shards of your collection, the other one has the replicas for those shards. In this scenario, if one VM is not available, you still have application availability as at least one replica is available for each shard. This assumes that your VM can fit all the data in one VM (all 20 shards) without compromising on performance or getting into memory or garbage collection issues (I am not sure what the size of your collection or shards is). For additional redundancy, you can add another VM and add another replica for for all your shards. 2. Can you provide more specifics around what sort of issues are you thinking of? Replication in general is pretty solid in the version you are talking about. You could comb through JIRA ( https://issues.apache.org/jira/browse/SOLR-5821?jql=project%20%3D%20SOLR%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22replica%22 ) 3. I would recommend you take a look at the Solr Collection API ( https://lucene.apache.org/solr/guide/6_6/collections-api.html). Parameters that you want to pay more attention to are "replicationFactor", "numShards" and "maxShardsPerNode" that relate to the shards and replicas. If you have a use case that warrants you to go beyond the above scenario of having all shards on the same VM, then you should read more into "maxShardsPerNode", etc. - but perhaps you can share a bit more around that use that. Thanks, -- Sameer Maggon https://www.searchstax.com | Solr-as-as-Service platform on AWS, Azure and GCP On Sat, Jan 27, 2018 at 2:08 AM, SOLR4189 wrote: > I use SOLR-6.5.1. I would like to use SolrCloud replicas. And I have some > questions: > > 1) What is the best architecture for this if my collection contains 20 > shards, and each shard is in different vm? 40 vms where 20 for leaders and > 20 for replicas? Or maybe stay with 20 vms where leader and replica (of > another leader) in the same vm but to add RAM? > > 2) What are opened issues about replicas in SOLR-6.5.1 that I need to > check? > > 3) If I use SolrCloud replica, which configuration parameters should I > change? Which can I change? > > > > -- > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: Using replicas in SOLR-6.5.1
1. You are right, due to memory and garbage collection issues I set each shard to different VM. So in my VM I has 50 GB RAM (10 GB for JVM and 40 GB for index) and it works good for my using case. Maybe I don't understand solr terms, but if you say to set one VM for 20 shards what does it mean? 20 nodes or 20 JVMs or 20 solr instances on the same virtual server? Can you explain what did you mean? 2. I speak about like issues: "facet perfomance regression" or "using ltr with grouping" or "using timeAllowed with grouping". Something that will stop me to use replicas feature. Sometimes I don't understand solr issues, for example, if bug is unresolved and affects version 4.10 and fix version none, what does it mean? This bug can happen in solr-6.5.1 also? 3. Yes, I'm familiar with the Solr Collection API. I preferred to set each shard to different small VMs. Just make sure with you *one solr node = one JVM = one solr instance = one or many shards? * Thank you. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Re: ***UNCHECKED*** Limit Solr search to number of character/words (without changing index)
Thanks. I do not want to search if the query is shorter than a certain number of terms/characters. For example, I have a 10MB document indexed in Solr what I want is to search query in first 1MB content of that indexed document. Any workaround e.g .can I send query to Solr to look for only 1MB from start of document.? On Fri, Jan 26, 2018 at 10:46 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) < dceccarel...@bloomberg.net> wrote: > Hi Zahid, if you want to allow searching only if the query is shorter than > a certain number of terms / characters, I would do it before calling solr > probably, otherwise you could write a QueryParserPlugin (see [1]) and check > that the query is sound before processing it. > See also: http://coding-art.blogspot.co.uk/2016/05/writing-custom- > solr-query-parser-for.html > > Cheers, > Diego > > [1] https://wiki.apache.org/solr/SolrPlugins > > > From: solr-user@lucene.apache.org At: 01/26/18 13:24:36To: > solr-user@lucene.apache.org > Cc: apa...@elyograg.org > Subject: ***UNCHECKED*** Limit Solr search to number of character/words > (without changing index) > > Hi All, > > Is there any way I can restrict Solr search query to look for specified > number of characters/words (for only searching purposes not for > highlighting) > > *For example:* > > *Indexed content:* > *I am a man of my words I am a lazy man...* > > Search to consider only below mentioned (words=7 or characters=16) > *I am a man of my words* > > If I search for *lazy *no record should find. > If I search for *a *1 record should find. > > > Thanks > Zahid Iqbal > > >
AW: AW: AW: SolrClient#updateByQuery?
Erick said/wrote: > If you commit after docs are deleted and _still_ see them in search results, > that's a JIRA should I JIRA it? -Ursprüngliche Nachricht- Von: Shawn Heisey [mailto:apa...@elyograg.org] Gesendet: Samstag, 27. Januar 2018 12:05 An: solr-user@lucene.apache.org Betreff: Re: AW: AW: SolrClient#updateByQuery? On 1/27/2018 12:49 AM, Clemens Wyss DEV wrote: > Thanks for all these (main contributor's 😉) valuable inputs! > > First thing I did was getting getting rid of "expungeDeletes". My > "single-deletion" unittest failed unti I added the optimize-param >> updateReques.setParam( "optimize", "true" ); > Does this make sense or should JIRA it? > How expensive ist this "optimization"? An optimize operation is a complete rewrite of the entire index to one segment. It will typically double the size of the index. The rewritten index will not have any documents that were deleted in it. It's slow and extremely expensive. If the index is one gigabyte, expect an optimize to take at least half an hour, possibly longer, to complete. The CPU and disk I/O are going to take a beating while the optimize is occurring. Thanks, Shawn
Re: AW: AW: SolrClient#updateByQuery?
Clemens: Let's not raise a JIRA quite yet. I am 99% sure your test is not doing what you think or you have some invalid expectations. This is such a fundamental feature that it'd surprise me a _lot_ if it were a bug. Also, there are a bunch of DeleteByQuery tests in the junit tests that's run all the time.. Wait, are you issuing an explicit commit or not? I saw this phrase "...brutely by forcing a commit (with "expunge deletes")..." and saw the word "commit" and assumed you were issuing a commit, but re-reading that's not clear at all. Code should look something like update-via-delete-by-query solrClient.commit(); query to see if doc is gone. So here's what I'd try next: 1> Issue an explicit commit command (SolrCient.commit()) after the DBQ. The defaults there are openSearcher = true and waitSearcher= true. When that returns _then_ issue your query. 2> If that doesn't work, try (just for information gathering) waiting, several seconds after the commit to try your request. This should _not_ be necessary, but it'll give us a clue what's going on. 3> Show us the code if you can. Best, Erick On Sat, Jan 27, 2018 at 6:55 AM, Clemens Wyss DEV wrote: > Erick said/wrote: >> If you commit after docs are deleted and _still_ see them in search results, >> that's a JIRA > should I JIRA it? > > -Ursprüngliche Nachricht- > Von: Shawn Heisey [mailto:apa...@elyograg.org] > Gesendet: Samstag, 27. Januar 2018 12:05 > An: solr-user@lucene.apache.org > Betreff: Re: AW: AW: SolrClient#updateByQuery? > > On 1/27/2018 12:49 AM, Clemens Wyss DEV wrote: >> Thanks for all these (main contributor's 😉) valuable inputs! >> >> First thing I did was getting getting rid of "expungeDeletes". My >> "single-deletion" unittest failed unti I added the optimize-param >>> updateReques.setParam( "optimize", "true" ); >> Does this make sense or should JIRA it? >> How expensive ist this "optimization"? > > An optimize operation is a complete rewrite of the entire index to one > segment. It will typically double the size of the index. The rewritten > index will not have any documents that were deleted in it. It's slow and > extremely expensive. If the index is one gigabyte, expect an optimize to > take at least half an hour, possibly longer, to complete. > The CPU and disk I/O are going to take a beating while the optimize is > occurring. > > Thanks, > Shawn
Re: ***UNCHECKED*** Limit Solr search to number of character/words (without changing index)
Sure, use TruncateFieldUpdateProcessorFactory in your update chain, here's the base definition: trunc 5 This _can_ be configured to operate on "all StrField", or "all TextFields" as well, see the Javadocs. This is static, that is the field is truncated at index time so you can't change the values per-request. Best, Erick On Sat, Jan 27, 2018 at 6:32 AM, Muhammad Zahid Iqbal wrote: > Thanks. > > I do not want to search if the query is shorter than a certain number of > terms/characters. > > For example, I have a 10MB document indexed in Solr what I want is to > search query in first 1MB content of that indexed document. > > Any workaround e.g .can I send query to Solr to look for only 1MB from > start of document.? > > > > On Fri, Jan 26, 2018 at 10:46 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) < > dceccarel...@bloomberg.net> wrote: > >> Hi Zahid, if you want to allow searching only if the query is shorter than >> a certain number of terms / characters, I would do it before calling solr >> probably, otherwise you could write a QueryParserPlugin (see [1]) and check >> that the query is sound before processing it. >> See also: http://coding-art.blogspot.co.uk/2016/05/writing-custom- >> solr-query-parser-for.html >> >> Cheers, >> Diego >> >> [1] https://wiki.apache.org/solr/SolrPlugins >> >> >> From: solr-user@lucene.apache.org At: 01/26/18 13:24:36To: >> solr-user@lucene.apache.org >> Cc: apa...@elyograg.org >> Subject: ***UNCHECKED*** Limit Solr search to number of character/words >> (without changing index) >> >> Hi All, >> >> Is there any way I can restrict Solr search query to look for specified >> number of characters/words (for only searching purposes not for >> highlighting) >> >> *For example:* >> >> *Indexed content:* >> *I am a man of my words I am a lazy man...* >> >> Search to consider only below mentioned (words=7 or characters=16) >> *I am a man of my words* >> >> If I search for *lazy *no record should find. >> If I search for *a *1 record should find. >> >> >> Thanks >> Zahid Iqbal >> >> >>
HDFS replication factor
Hi, when I configure my HDFS setup to use a specific replication factor, like 1, this only effects the index files that Solr writes. The write.lock files and backups are being created with a different replication factor. The reason for this should be that HdfsFileWriter is loading the defaults from the server (fileSystem.getServerDefaults(path)) while HdfsLockFactory and HdfsBackupRepository are simply using defaults, which seems to end up using a replication factor of 3 (and a block size of 128MB). Is this known? If not shall I open a JIRA for this? regards, Hendrik
Facing issue while writing more than one DIH for a core.
Hi All, Below is the DIH configurations for the Data import handlers for a core. *For DIH-1:* https://stackoverflow.com/feeds/tag/solr"; processor="XPathEntityProcessor" dataSource="URLDataSource" forEach="/feed|/feed/entry" transformer="HTMLStripTransformer,RegexTransformer"> *<**field name="dih_type" value="Feed"/>* *For DiH-2:* http://127.0.0.1:9983/solr/briefs2 " query="*:*" fl="id,title,lead,d_company,d_industry,d_location,d_created_on,d_updated_on"> *<**field name="dih_type" value="Solr"/>* *The problems i am facing is follows:* 1. *I am not able to set field without column attribute.* * <**field name="dih_type" value="Feed"/>* * Is there any other way to do this?* 2. *How can i set authentication details for both Data import Handlers?* Regards, Sanjeet.