Re: Multi threading indexing

2018-05-14 Thread Mikhail Khludnev
A few years ago I provided server side concurrency "booster"
https://issues.apache.org/jira/browse/SOLR-3585.
But now, I'd rather suppose it's client-side (or ETL) duty.

On Mon, May 14, 2018 at 6:39 AM, Raymond Xie  wrote:

> Hello,
>
> I have a huge amount of data (TB level) to be indexed, I am wondering if
> anyone can share your idea/code to do the multithreading indexing?
>
> **
> *Sincerely yours,*
>
>
> *Raymond*
>



-- 
Sincerely yours
Mikhail Khludnev


Re: How to restart solr in docker?

2018-05-14 Thread reznov9185
This is what I needed to do for updating the solrconfig files from local to
docker:
`sudo docker cp docker/solr/production/conf/solrconfig.xml
solr:/opt/solr/server/solr/production/conf/solrconfig.xml`
`sudo docker restart solr`
For some reason this is not syncing automatically, so I had to cp the
changed configs. 




--
Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html


Techniques for Retrieving Hits

2018-05-14 Thread Terry Steichen
In order to allow users to retrieve the documents that match a query, I
make use of the embedded Jetty container to provide file server
functionality.  To make this happen, I provide a symbolic link between
the actual document archive, and the Jetty file server.  This seems
somewhat of a kludge, and I'm wondering if others have a better way to
retrieve the desired documents?  (I'm not too concerned about security
because I use ssh port forwarding to connect to remote authenticated
clients.)



Re: Async exceptions during distributed update

2018-05-14 Thread Jay Potharaju
Adding some more context to my last email
Solr:6.6.3
2 nodes : 3 shards each
No replication .
Can someone answer the following questions 
1) any ideas on why the following errors keep happening. AFAIK streaming solr 
clients error is  because of timeouts when connecting to other nodes. 
Async errors are also network related as explained earlier in the email by Emir.
There were no network issues but the error has comeback and filling up my logs. 
2) is anyone using solr 6.6.3 in production and what has their experience been 
so far.
3) is there any good documentation or blog post that would explain about inner 
working of solrcloud networking?

Thanks
Jay
org.apache.solr.update.StreamingSolrClients  
>  
> org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
>  Async exception during 


> On May 13, 2018, at 9:21 PM, Jay Potharaju  wrote:
> 
> Hi,
> I restarted both my solr servers but I am seeing the async error again. In 
> older 5x version of solrcloud, solr would normally recover gracefully in case 
> of network errors, but solr 6.6.3 does not seem to be doing that. At this 
> time I am not doing only a small percentage of  deletebyquery operations, its 
> mostly indexing of documents only.
> I have not noticed any network blip like last time.  Any suggestions or is 
> any else also having the same issue on solr 6.6.3?
> 
>   I am again seeing the following two errors back to back. 
> 
>  ERROR org.apache.solr.update.StreamingSolrClients  
>  
> org.apache.solr.update.processor.DistributedUpdateProcessor$DistributedUpdatesAsyncException:
>  Async exception during distributed update: Read timed out
> Thanks
> Jay 
>  
> 
> 
>> On Wed, May 9, 2018 at 12:34 AM Emir Arnautović 
>>  wrote:
>> Hi Jay,
>> Network blip might be the cause, but also the consequence of this issue. 
>> Maybe you can try avoiding DBQ while indexing and see if it is the cause. 
>> You can do thread dump on “the other” node and see if there are blocked 
>> threads and that can give you more clues what’s going on.
>> 
>> Thanks,
>> Emir
>> --
>> Monitoring - Log Management - Alerting - Anomaly Detection
>> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> 
>> 
>> 
>> > On 8 May 2018, at 17:53, Jay Potharaju  wrote:
>> > 
>> > Hi Emir,
>> > I was seeing this error as long as the indexing was running. Once I stopped
>> > the indexing the errors also stopped.  Yes, we do monitor both hosts & solr
>> > but have not seen anything out of the ordinary except for a small network
>> > blip. In my experience solr generally recovers after a network blip and
>> > there are a few errors for streaming solr client...but have never seen this
>> > error before.
>> > 
>> > Thanks
>> > Jay
>> > 
>> > Thanks
>> > Jay Potharaju
>> > 
>> > 
>> > On Tue, May 8, 2018 at 12:56 AM, Emir Arnautović <
>> > emir.arnauto...@sematext.com> wrote:
>> > 
>> >> Hi Jay,
>> >> This is low ingestion rate. What is the size of your index? What is heap
>> >> size? I am guessing that this is not a huge index, so  I am leaning toward
>> >> what Shawn mentioned - some combination of DBQ/merge/commit/optimise that
>> >> is blocking indexing. Though, it is strange that it is happening only on
>> >> one node if you are sending updates randomly to both nodes. Do you monitor
>> >> your hosts/Solr? Do you see anything different at the time when timeouts
>> >> happen?
>> >> 
>> >> Thanks,
>> >> Emir
>> >> --
>> >> Monitoring - Log Management - Alerting - Anomaly Detection
>> >> Solr & Elasticsearch Consulting Support Training - http://sematext.com/
>> >> 
>> >> 
>> >> 
>> >>> On 8 May 2018, at 03:23, Jay Potharaju  wrote:
>> >>> 
>> >>> I have about 3-5 updates per second.
>> >>> 
>> >>> 
>>  On May 7, 2018, at 5:02 PM, Shawn Heisey  wrote:
>>  
>> > On 5/7/2018 5:05 PM, Jay Potharaju wrote:
>> > There are some deletes by query. I have not had any issues with DBQ,
>> > currently have 5.3 running in production.
>>  
>>  Here's the big problem with DBQ.  Imagine this sequence of events with
>>  these timestamps:
>>  
>>  13:00:00: A commit for change visibility happens.
>>  13:00:00: A segment merge is triggered by the commit.
>>  (It's a big merge that takes exactly 3 minutes.)
>>  13:00:05: A deleteByQuery is sent.
>>  13:00:15: An update to the index is sent.
>>  13:00:25: An update to the index is sent.
>>  13:00:35: An update to the index is sent.
>>  13:00:45: An update to the index is sent.
>>  13:00:55: An update to the index is sent.
>>  13:01:05: An update to the index is sent.
>>  13:01:15: An update to the index is sent.
>>  13:01:25: An update to the index is sent.
>>  {time passes, more updates might be sent}
>>  13:03:00: The merge finishes.
>>  
>>  Here's what would happen in this scenario:  The DBQ and all of the
>>  update requests sent *after* the DBQ will block until the merge
>>  finishes.  

Re: Techniques for Retrieving Hits

2018-05-14 Thread Shawn Heisey

On 5/14/2018 6:46 AM, Terry Steichen wrote:

In order to allow users to retrieve the documents that match a query, I
make use of the embedded Jetty container to provide file server
functionality.  To make this happen, I provide a symbolic link between
the actual document archive, and the Jetty file server.  This seems
somewhat of a kludge, and I'm wondering if others have a better way to
retrieve the desired documents?  (I'm not too concerned about security
because I use ssh port forwarding to connect to remote authenticated
clients.)


This is not a recommended usage for the servlet container where Solr runs.

Solr is a search engine.  It is not designed to be a data store, 
although some people do use it that way.


If systems running Solr clients want to access all the information for a 
document when the search results do not contain all the information, 
they should use what IS in the search results to access that data from 
the system where it is stored -- that could be a database, a file 
server, a webserver, or similar.


Thanks,
Shawn



Commit too slow?

2018-05-14 Thread LOPEZ-CORTES Mariano-ext
Hi

After having injecting 200 documents in our Solr server, the commit 
operation at the end of the process (using ConcurrentUpdateSolrClient) take 10 
minutes. It's too slow?

Our auto-commit policy is the following:

 
  15000
  false
 
 
  15000
 
Thanks !



Re: Commit too slow?

2018-05-14 Thread Shawn Heisey
On 5/14/2018 11:29 AM, LOPEZ-CORTES Mariano-ext wrote:
> After having injecting 200 documents in our Solr server, the commit 
> operation at the end of the process (using ConcurrentUpdateSolrClient) take 
> 10 minutes. It's too slow?

There is a wiki page discussing slow commits:

https://wiki.apache.org/solr/SolrPerformanceProblems#Slow_commits

Thanks,
Shawn



Re: Techniques for Retrieving Hits

2018-05-14 Thread Terry Steichen
Shawn,

As noted in my embedded comments below, I don't really see the problem
you apparently do. 

Maybe I'm missing something important (which certainly wouldn't  be the
first - or last -  time that happened).

I posted this note because I've not seen list comments pertaining to the
job of actually locating and retrieving hitlist documents. 

My way "seems" to work, and it is quite simple and compact.  I just
threw it out seeking a sanity check from others.

Terry


On 05/14/2018 11:32 AM, Shawn Heisey wrote:
> On 5/14/2018 6:46 AM, Terry Steichen wrote:
>> In order to allow users to retrieve the documents that match a query, I
>> make use of the embedded Jetty container to provide file server
>> functionality.  To make this happen, I provide a symbolic link between
>> the actual document archive, and the Jetty file server.  This seems
>> somewhat of a kludge, and I'm wondering if others have a better way to
>> retrieve the desired documents?  (I'm not too concerned about security
>> because I use ssh port forwarding to connect to remote authenticated
>> clients.)
>
> This is not a recommended usage for the servlet container where Solr
> runs.
But if the retrieval traffic is light, what's the problem?
>
> Solr is a search engine.  It is not designed to be a data store,
> although some people do use it that way.
Perhaps I didn't explain it right, but I'm not using it as a datastore
(other than the fact that I keep the actual file repository on the same
machine on which Solr runs.  I've got plenty of storage, so that's not
an issue, and, as I mentioned above, traffic is quite light.
>
> If systems running Solr clients want to access all the information for
> a document when the search results do not contain all the information,
> they should use what IS in the search results to access that data from
> the system where it is stored -- that could be a database, a file
> server, a webserver, or similar.
Perhaps I'm missing something, but search results cannot "contain all
the information" can they?  I use highlighting but that's just showing a
few snippets - not a substitute for the document itself.
>
> Thanks,
> Shawn
>
>



Re: Techniques for Retrieving Hits

2018-05-14 Thread Shawn Heisey
On 5/14/2018 3:13 PM, Terry Steichen wrote:
> I posted this note because I've not seen list comments pertaining to the
> job of actually locating and retrieving hitlist documents.

How documents are retrieved will be highly dependent on your setup. 
Here's how things usually go:

If the original data came from a database, then the system where people
do their searches should know how to talk to the database, and use
information in the search results to look up the full original document
in the database.

If the source data is on a file server, then the system where people do
their searches will need to have the file server storage mounted.  It
will then use information in the search results to access the full
original document.

Ditto for any other kind of canonical data store with Solr as the search
engine.

The system where searches are done will be implemented by you.  It will
be up to that system to handle any kind of security filtering for both
Solr searches and document access.

Solr should not be exposed directly to end users.  Most of the time,
what's in Solr is not particularly sensitive ... but when Solr is
exposed to people who cannot be trusted, those end users may be able to
change or delete any data in Solr.  They might also be able to send
denial of service queries directly to Solr.

Thanks,
Shawn



[ANNOUNCE] Apache Solr 7.3.1 released

2018-05-14 Thread Cao Mạnh Đạt
15 May 2018, Apache Solr™ 7.3.1 available

The Lucene PMC is pleased to announce the release of Apache Solr 7.3.1

Solr is the popular, blazing fast, open source NoSQL search platform from
the Apache Lucene project. Its major features include powerful full-text
search, hit highlighting, faceted search and analytics, rich document
parsing, geospatial search, extensive REST APIs as well as parallel SQL.
Solr is enterprise grade, secure and highly scalable, providing fault
tolerant distributed search and indexing, and powers the search and
navigation features of many of the world's largest internet sites.

This release includes 9 bug fixes since the 7.3.0 release. Some of the
major fixes are:

* Deleting replicas sometimes fails and causes the replicas to exist in the
down state
* Upgrade commons-fileupload dependency to 1.3.3 to address CVE-2016-131
* Do not allow to use absolute URIs for including other files in
solrconfig.xml and schema parsing
* A successful restore collection should mark the shard state as active and
not buffering

Furthermore, this release includes Apache Lucene 7.3.1 which includes 1 bug
fixes since the 7.3.0 release.

The release is available for immediate download at:

http://www.apache.org/dyn/closer.lua/lucene/solr/7.3.1

Please read CHANGES.txt for a detailed list of changes:

https://lucene.apache.org/solr/7_3_1/changes/Changes.html

Please report any feedback to the mailing lists (
http://lucene.apache.org/solr/discussion.html)

Note: The Apache Software Foundation uses an extensive mirroring network
for distributing releases. It is possible that the mirror you are using may
not have replicated the release yet. If that is the case, please try
another mirror. This also goes for Maven access.


Re: question about updates to shard leaders only

2018-05-14 Thread Bernd Fehling

OK, I have the CloudSolrClient with SolrJ now running but it seams
a bit slower compared to ConcurrentUpdateSolrClient.
This was not expected.
The logs show that CloudSolrClient send the docs only to the leaders.

So the only advantage of CloudSolrClient is that it is "Cloud aware"?

With ConcurrentUpdateSolrClient I get about 1600 docs/sec for loading.
With CloudSolrClient I get only about 1200 docs/sec.

The system monitoring shows that with CloudSolrClient all nodes and cores
are under heavy load. I thought that only the leaders are under load
until any commit and then replicate to the other replicas.
And that the replicas which are no leader have capacity to answer search 
requests.

I think I still don't get the advantage of CloudSolrClient?

Regards,
Bernd



Am 09.05.2018 um 19:15 schrieb Erick Erickson:

You may not need to deal with any of this.

The default CloudSolrClient call creates a new LBHttpSolrClient for
you. So unless you're doing something custom with any LBHttpSolrClient
you create, you don't need to create one yourself.

Second, the default for CloudSolrClient.add() is to take the list of
documents you provide into sub-lists that consist of the docs destined
for a particular shard and sends those to the leader.

Do the default not work for you?

Best,
Erick

On Wed, May 9, 2018 at 2:54 AM, Bernd Fehling
 wrote:

Hi list,

while going from single core master/slave to cloud multi core/node
with leader/replica I want to change my SolrJ loading, because
ConcurrentUpdateSolrClient isn't cloud aware and has performance
impacts.
I want to use CloudSolrClient with LBHttpSolrClient and updates
should only go to shard leaders.

Question, what is the difference between sendUpdatesOnlyToShardLeaders
and sendDirectUpdatesToShardLeadersOnly?

Regards,
Bernd