Re: Multi threading indexing
A few years ago I provided server side concurrency "booster" https://issues.apache.org/jira/browse/SOLR-3585. But now, I'd rather suppose it's client-side (or ETL) duty. On Mon, May 14, 2018 at 6:39 AM, Raymond Xie wrote: > Hello, > > I have a huge amount of data (TB level) to be indexed, I am wondering if > anyone can share your idea/code to do the multithreading indexing? > > ** > *Sincerely yours,* > > > *Raymond* > -- Sincerely yours Mikhail Khludnev
Re: How to restart solr in docker?
This is what I needed to do for updating the solrconfig files from local to docker: `sudo docker cp docker/solr/production/conf/solrconfig.xml solr:/opt/solr/server/solr/production/conf/solrconfig.xml` `sudo docker restart solr` For some reason this is not syncing automatically, so I had to cp the changed configs. -- Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
Techniques for Retrieving Hits
In order to allow users to retrieve the documents that match a query, I make use of the embedded Jetty container to provide file server functionality. To make this happen, I provide a symbolic link between the actual document archive, and the Jetty file server. This seems somewhat of a kludge, and I'm wondering if others have a better way to retrieve the desired documents? (I'm not too concerned about security because I use ssh port forwarding to connect to remote authenticated clients.)
Re: Async exceptions during distributed update
Adding some more context to my last email Solr:6.6.3 2 nodes : 3 shards each No replication . Can someone answer the following questions 1) any ideas on why the following errors keep happening. AFAIK streaming solr clients error is because of timeouts when connecting to other nodes. Async errors are also network related as explained earlier in the email by Emir. There were no network issues but the error has comeback and filling up my logs. 2) is anyone using solr 6.6.3 in production and what has their experience been so far. 3) is there any good documentation or blog post that would explain about inner working of solrcloud networking? Thanks Jay org.apache.solr.update.StreamingSolrClients > > org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: > Async exception during > On May 13, 2018, at 9:21 PM, Jay Potharaju wrote: > > Hi, > I restarted both my solr servers but I am seeing the async error again. In > older 5x version of solrcloud, solr would normally recover gracefully in case > of network errors, but solr 6.6.3 does not seem to be doing that. At this > time I am not doing only a small percentage of deletebyquery operations, its > mostly indexing of documents only. > I have not noticed any network blip like last time. Any suggestions or is > any else also having the same issue on solr 6.6.3? > > I am again seeing the following two errors back to back. > > ERROR org.apache.solr.update.StreamingSolrClients > > org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException: > Async exception during distributed update: Read timed out > Thanks > Jay > > > >> On Wed, May 9, 2018 at 12:34 AM Emir Arnautović >> wrote: >> Hi Jay, >> Network blip might be the cause, but also the consequence of this issue. >> Maybe you can try avoiding DBQ while indexing and see if it is the cause. >> You can do thread dump on “the other” node and see if there are blocked >> threads and that can give you more clues what’s going on. >> >> Thanks, >> Emir >> -- >> Monitoring - Log Management - Alerting - Anomaly Detection >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >> >> >> >> > On 8 May 2018, at 17:53, Jay Potharaju wrote: >> > >> > Hi Emir, >> > I was seeing this error as long as the indexing was running. Once I stopped >> > the indexing the errors also stopped. Yes, we do monitor both hosts & solr >> > but have not seen anything out of the ordinary except for a small network >> > blip. In my experience solr generally recovers after a network blip and >> > there are a few errors for streaming solr client...but have never seen this >> > error before. >> > >> > Thanks >> > Jay >> > >> > Thanks >> > Jay Potharaju >> > >> > >> > On Tue, May 8, 2018 at 12:56 AM, Emir Arnautović < >> > emir.arnauto...@sematext.com> wrote: >> > >> >> Hi Jay, >> >> This is low ingestion rate. What is the size of your index? What is heap >> >> size? I am guessing that this is not a huge index, so I am leaning toward >> >> what Shawn mentioned - some combination of DBQ/merge/commit/optimise that >> >> is blocking indexing. Though, it is strange that it is happening only on >> >> one node if you are sending updates randomly to both nodes. Do you monitor >> >> your hosts/Solr? Do you see anything different at the time when timeouts >> >> happen? >> >> >> >> Thanks, >> >> Emir >> >> -- >> >> Monitoring - Log Management - Alerting - Anomaly Detection >> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/ >> >> >> >> >> >> >> >>> On 8 May 2018, at 03:23, Jay Potharaju wrote: >> >>> >> >>> I have about 3-5 updates per second. >> >>> >> >>> >> On May 7, 2018, at 5:02 PM, Shawn Heisey wrote: >> >> > On 5/7/2018 5:05 PM, Jay Potharaju wrote: >> > There are some deletes by query. I have not had any issues with DBQ, >> > currently have 5.3 running in production. >> >> Here's the big problem with DBQ. Imagine this sequence of events with >> these timestamps: >> >> 13:00:00: A commit for change visibility happens. >> 13:00:00: A segment merge is triggered by the commit. >> (It's a big merge that takes exactly 3 minutes.) >> 13:00:05: A deleteByQuery is sent. >> 13:00:15: An update to the index is sent. >> 13:00:25: An update to the index is sent. >> 13:00:35: An update to the index is sent. >> 13:00:45: An update to the index is sent. >> 13:00:55: An update to the index is sent. >> 13:01:05: An update to the index is sent. >> 13:01:15: An update to the index is sent. >> 13:01:25: An update to the index is sent. >> {time passes, more updates might be sent} >> 13:03:00: The merge finishes. >> >> Here's what would happen in this scenario: The DBQ and all of the >> update requests sent *after* the DBQ will block until the merge >> finishes.
Re: Techniques for Retrieving Hits
On 5/14/2018 6:46 AM, Terry Steichen wrote: In order to allow users to retrieve the documents that match a query, I make use of the embedded Jetty container to provide file server functionality. To make this happen, I provide a symbolic link between the actual document archive, and the Jetty file server. This seems somewhat of a kludge, and I'm wondering if others have a better way to retrieve the desired documents? (I'm not too concerned about security because I use ssh port forwarding to connect to remote authenticated clients.) This is not a recommended usage for the servlet container where Solr runs. Solr is a search engine. It is not designed to be a data store, although some people do use it that way. If systems running Solr clients want to access all the information for a document when the search results do not contain all the information, they should use what IS in the search results to access that data from the system where it is stored -- that could be a database, a file server, a webserver, or similar. Thanks, Shawn
Commit too slow?
Hi After having injecting 200 documents in our Solr server, the commit operation at the end of the process (using ConcurrentUpdateSolrClient) take 10 minutes. It's too slow? Our auto-commit policy is the following: 15000 false 15000 Thanks !
Re: Commit too slow?
On 5/14/2018 11:29 AM, LOPEZ-CORTES Mariano-ext wrote: > After having injecting 200 documents in our Solr server, the commit > operation at the end of the process (using ConcurrentUpdateSolrClient) take > 10 minutes. It's too slow? There is a wiki page discussing slow commits: https://wiki.apache.org/solr/SolrPerformanceProblems#Slow_commits Thanks, Shawn
Re: Techniques for Retrieving Hits
Shawn, As noted in my embedded comments below, I don't really see the problem you apparently do. Maybe I'm missing something important (which certainly wouldn't be the first - or last - time that happened). I posted this note because I've not seen list comments pertaining to the job of actually locating and retrieving hitlist documents. My way "seems" to work, and it is quite simple and compact. I just threw it out seeking a sanity check from others. Terry On 05/14/2018 11:32 AM, Shawn Heisey wrote: > On 5/14/2018 6:46 AM, Terry Steichen wrote: >> In order to allow users to retrieve the documents that match a query, I >> make use of the embedded Jetty container to provide file server >> functionality. To make this happen, I provide a symbolic link between >> the actual document archive, and the Jetty file server. This seems >> somewhat of a kludge, and I'm wondering if others have a better way to >> retrieve the desired documents? (I'm not too concerned about security >> because I use ssh port forwarding to connect to remote authenticated >> clients.) > > This is not a recommended usage for the servlet container where Solr > runs. But if the retrieval traffic is light, what's the problem? > > Solr is a search engine. It is not designed to be a data store, > although some people do use it that way. Perhaps I didn't explain it right, but I'm not using it as a datastore (other than the fact that I keep the actual file repository on the same machine on which Solr runs. I've got plenty of storage, so that's not an issue, and, as I mentioned above, traffic is quite light. > > If systems running Solr clients want to access all the information for > a document when the search results do not contain all the information, > they should use what IS in the search results to access that data from > the system where it is stored -- that could be a database, a file > server, a webserver, or similar. Perhaps I'm missing something, but search results cannot "contain all the information" can they? I use highlighting but that's just showing a few snippets - not a substitute for the document itself. > > Thanks, > Shawn > >
Re: Techniques for Retrieving Hits
On 5/14/2018 3:13 PM, Terry Steichen wrote: > I posted this note because I've not seen list comments pertaining to the > job of actually locating and retrieving hitlist documents. How documents are retrieved will be highly dependent on your setup. Here's how things usually go: If the original data came from a database, then the system where people do their searches should know how to talk to the database, and use information in the search results to look up the full original document in the database. If the source data is on a file server, then the system where people do their searches will need to have the file server storage mounted. It will then use information in the search results to access the full original document. Ditto for any other kind of canonical data store with Solr as the search engine. The system where searches are done will be implemented by you. It will be up to that system to handle any kind of security filtering for both Solr searches and document access. Solr should not be exposed directly to end users. Most of the time, what's in Solr is not particularly sensitive ... but when Solr is exposed to people who cannot be trusted, those end users may be able to change or delete any data in Solr. They might also be able to send denial of service queries directly to Solr. Thanks, Shawn
[ANNOUNCE] Apache Solr 7.3.1 released
15 May 2018, Apache Solr™ 7.3.1 available The Lucene PMC is pleased to announce the release of Apache Solr 7.3.1 Solr is the popular, blazing fast, open source NoSQL search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search and analytics, rich document parsing, geospatial search, extensive REST APIs as well as parallel SQL. Solr is enterprise grade, secure and highly scalable, providing fault tolerant distributed search and indexing, and powers the search and navigation features of many of the world's largest internet sites. This release includes 9 bug fixes since the 7.3.0 release. Some of the major fixes are: * Deleting replicas sometimes fails and causes the replicas to exist in the down state * Upgrade commons-fileupload dependency to 1.3.3 to address CVE-2016-131 * Do not allow to use absolute URIs for including other files in solrconfig.xml and schema parsing * A successful restore collection should mark the shard state as active and not buffering Furthermore, this release includes Apache Lucene 7.3.1 which includes 1 bug fixes since the 7.3.0 release. The release is available for immediate download at: http://www.apache.org/dyn/closer.lua/lucene/solr/7.3.1 Please read CHANGES.txt for a detailed list of changes: https://lucene.apache.org/solr/7_3_1/changes/Changes.html Please report any feedback to the mailing lists ( http://lucene.apache.org/solr/discussion.html) Note: The Apache Software Foundation uses an extensive mirroring network for distributing releases. It is possible that the mirror you are using may not have replicated the release yet. If that is the case, please try another mirror. This also goes for Maven access.
Re: question about updates to shard leaders only
OK, I have the CloudSolrClient with SolrJ now running but it seams a bit slower compared to ConcurrentUpdateSolrClient. This was not expected. The logs show that CloudSolrClient send the docs only to the leaders. So the only advantage of CloudSolrClient is that it is "Cloud aware"? With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading. With CloudSolrClient I get only about 1200 docs/sec. The system monitoring shows that with CloudSolrClient all nodes and cores are under heavy load. I thought that only the leaders are under load until any commit and then replicate to the other replicas. And that the replicas which are no leader have capacity to answer search requests. I think I still don't get the advantage of CloudSolrClient? Regards, Bernd Am 09.05.2018 um 19:15 schrieb Erick Erickson: You may not need to deal with any of this. The default CloudSolrClient call creates a new LBHttpSolrClient for you. So unless you're doing something custom with any LBHttpSolrClient you create, you don't need to create one yourself. Second, the default for CloudSolrClient.add() is to take the list of documents you provide into sub-lists that consist of the docs destined for a particular shard and sends those to the leader. Do the default not work for you? Best, Erick On Wed, May 9, 2018 at 2:54 AM, Bernd Fehling wrote: Hi list, while going from single core master/slave to cloud multi core/node with leader/replica I want to change my SolrJ loading, because ConcurrentUpdateSolrClient isn't cloud aware and has performance impacts. I want to use CloudSolrClient with LBHttpSolrClient and updates should only go to shard leaders. Question, what is the difference between sendUpdatesOnlyToShardLeaders and sendDirectUpdatesToShardLeadersOnly? Regards, Bernd