Re: Possible or not ?

Erick Erickson Fri, 05 Jun 2015 08:38:36 -0700

Picking up on Alessandro's point. While you can post all these docs
and commit at the end, unless you do a hard commit (
openSearcher=true or false doesn't matter), then if your server should
abnormally terminate for _any_ reason, all these docs will be
replayed on startup from the transaction log.


I'll also echo Alessandro's point that I don't see the advantage of this.
Personally I'd set my hard commit interval with openSearcher=false
to something like 60000 (60 seconds it's in milliseconds) and forget
about it. You're not imposing  much extra load on the system, you're
durably saving your progress, you're avoiding really, really, really
long restarts if your server should stop for some reason.

If you don't want the docs to be _visible_ for searches, be sure your
autocommit has openSearcer set to false and disable soft commits
(set the interval to -1 or remove it from your solrconfig).

Best,
Erick

On Fri, Jun 5, 2015 at 8:21 AM, Alessandro Benedetti
<benedetti.ale...@gmail.com> wrote:
> I can not see any problem in that, but talking about commits I would like
> to make a difference between "Hard" and "Soft" .
>
> Hard commit -> durability
> Soft commit -> visibility
>
> I suggest you this interesting reading :
> https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
> It's an old interesting Erick post.
>
> It explains you better what are the differences between different commit
> types.
>
> I would put you in this scenario :
>
> Heavy (bulk) indexing
>>
>> The assumption here is that you’re interested in getting lots of data to
>> the index as quickly as possible for search sometime in the future. I’m
>> thinking original loads of a data source etc.
>>
>>    - Set your soft commit interval quite long. As in 10 minutes or even
>>    longer (-1 for no soft commits at all). *Soft commit is about
>>    visibility, *and my assumption here is that bulk indexing isn’t about
>>    near real time searching so don’t do the extra work of opening any kind of
>>    searcher.
>>    - Set your hard commit intervals to 15 seconds, openSearcher=false.
>>    Again the assumption is that you’re going to be just blasting data at 
>> Solr.
>>    The worst case here is that you restart your system and have to replay 15
>>    seconds or so of data from your tlog. If your system is bouncing up and
>>    down more often than that, fix the reason for that first.
>>    - Only after you’ve tried the simple things should you consider
>>    refinements, they’re usually only required in unusual circumstances. But
>>    they include:
>>       - Turning off the tlog completely for the bulk-load operation
>>       - Indexing offline with some kind of map-reduce process
>>       - Only having a leader per shard, no replicas for the load, then
>>       turning on replicas later and letting them do old-style replication to
>>       catch up. Note that this is automatic, if the node discovers it is “too
>>       far” out of sync with the leader, it initiates an old-style 
>> replication.
>>       After it has caught up, it’ll get documents as they’re indexed to the
>>       leader and keep its own tlog.
>>       - etc.
>>
>>
> Actually you could do the commit only at the end, but I can not see any
> advantage in that.
> I suggest you to play with auto hard/soft commit config and get a better
> idea of the situation !
>
> Cheers
>
> 2015-06-05 16:08 GMT+01:00 Bruno Mannina <bmann...@free.fr>:
>
>> Hi Alessandro,
>>
>> I'm actually on my dev' computer, so I would like to post 1 000 000 xml
>> file (with a structure defined in my schema.xml)
>>
>> I have already import 1 000 000 xml files by using
>> bin/post -c mydb /DATA0/1 /DATA0/2 /DATA0/3 /DATA0/4 /DATA0/5
>> where /DATA0/X contains 20 000 xml files (I do it 20 times by just
>> changing X from 1 to 50)
>>
>> I would like to do now
>> bin/post -c mydb /DATA1
>>
>> I would like to know If my SOLR5 will run fine and no provide an memory
>> error because there are too many files
>> in one post without doing a commit?
>>
>> The commit will be done at the end of 1 000 000.
>>
>> Is it ok ?
>>
>>
>>
>> Le 05/06/2015 16:59, Alessandro Benedetti a écrit :
>>
>>> Hi Bruno,
>>> I can not see what is your challenge.
>>> Of course you can index your data in the flavour you want and do a commit
>>> whenever you want…
>>> Are those xml Solr xml ?
>>> If not you would need to use the DIH, the extract update handler or any
>>> custom Indexer application.
>>> Maybe I missed your point…
>>> Give me more details please !
>>>
>>> Cheers
>>>
>>> 2015-06-05 15:41 GMT+01:00 Bruno Mannina <bmann...@free.fr>:
>>>
>>>  Dear Solr Users,
>>>>
>>>> I would like to post  1 000 000 records (1 records = 1 files) in one
>>>> shoot
>>>> ?
>>>> and do the commit and the end.
>>>>
>>>> Is it possible to do that ?
>>>>
>>>> I've several directories with each 20 000 files inside.
>>>> I would like to do:
>>>> bin/post -c mydb /DATA
>>>>
>>>> under DATA I have
>>>> /DATA/1/*.xml (20 000 files)
>>>> /DATA/2/*.xml (20 000 files)
>>>> /DATA/3/*.xml (20 000 files)
>>>> ....
>>>> /DATA/50/*.xml (20 000 files)
>>>>
>>>> Actually, I post 5 directories in one time (it takes around 1h30 for 100
>>>> 000 records/files)
>>>>
>>>> But it's Friday and I would like to run it during the W.E. alone.
>>>>
>>>> Thanks for your comment,
>>>>
>>>> Bruno
>>>>
>>>> ---
>>>> Ce courrier électronique ne contient aucun virus ou logiciel malveillant
>>>> parce que la protection avast! Antivirus est active.
>>>> https://www.avast.com/antivirus
>>>>
>>>>
>>>>
>>>
>>
>> ---
>> Ce courrier électronique ne contient aucun virus ou logiciel malveillant
>> parce que la protection avast! Antivirus est active.
>> https://www.avast.com/antivirus
>>
>>
>
>
> --
> --------------------------
>
> Benedetti Alessandro
> Visiting card : http://about.me/alessandro_benedetti
>
> "Tyger, tyger burning bright
> In the forests of the night,
> What immortal hand or eye
> Could frame thy fearful symmetry?"
>
> William Blake - Songs of Experience -1794 England

Re: Possible or not ?

Reply via email to