Re: Possible or not ?

Bruno Mannina Fri, 05 Jun 2015 09:42:05 -0700

Ok thanks for these information !

Le 05/06/2015 17:37, Erick Erickson a écrit :

Picking up on Alessandro's point. While you can post all these docs
and commit at the end, unless you do a hard commit (
openSearcher=true or false doesn't matter), then if your server should
abnormally terminate for _any_ reason, all these docs will be
replayed on startup from the transaction log.


I'll also echo Alessandro's point that I don't see the advantage of this.
Personally I'd set my hard commit interval with openSearcher=false
to something like 60000 (60 seconds it's in milliseconds) and forget
about it. You're not imposing  much extra load on the system, you're
durably saving your progress, you're avoiding really, really, really
long restarts if your server should stop for some reason.

If you don't want the docs to be _visible_ for searches, be sure your
autocommit has openSearcer set to false and disable soft commits
(set the interval to -1 or remove it from your solrconfig).

Best,
Erick

On Fri, Jun 5, 2015 at 8:21 AM, Alessandro Benedetti
<benedetti.ale...@gmail.com> wrote:

I can not see any problem in that, but talking about commits I would like
to make a difference between "Hard" and "Soft" .

Hard commit -> durability
Soft commit -> visibility

I suggest you this interesting reading :
https://lucidworks.com/blog/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/
It's an old interesting Erick post.

It explains you better what are the differences between different commit
types.

I would put you in this scenario :

Heavy (bulk) indexing

The assumption here is that you’re interested in getting lots of data to
the index as quickly as possible for search sometime in the future. I’m
thinking original loads of a data source etc.

    - Set your soft commit interval quite long. As in 10 minutes or even
    longer (-1 for no soft commits at all). *Soft commit is about
    visibility, *and my assumption here is that bulk indexing isn’t about
    near real time searching so don’t do the extra work of opening any kind of
    searcher.
    - Set your hard commit intervals to 15 seconds, openSearcher=false.
    Again the assumption is that you’re going to be just blasting data at Solr.
    The worst case here is that you restart your system and have to replay 15
    seconds or so of data from your tlog. If your system is bouncing up and
    down more often than that, fix the reason for that first.
    - Only after you’ve tried the simple things should you consider
    refinements, they’re usually only required in unusual circumstances. But
    they include:
       - Turning off the tlog completely for the bulk-load operation
       - Indexing offline with some kind of map-reduce process
       - Only having a leader per shard, no replicas for the load, then
       turning on replicas later and letting them do old-style replication to
       catch up. Note that this is automatic, if the node discovers it is “too
       far” out of sync with the leader, it initiates an old-style replication.
       After it has caught up, it’ll get documents as they’re indexed to the
       leader and keep its own tlog.
       - etc.

Actually you could do the commit only at the end, but I can not see any
advantage in that.
I suggest you to play with auto hard/soft commit config and get a better
idea of the situation !

Cheers

2015-06-05 16:08 GMT+01:00 Bruno Mannina <bmann...@free.fr>:

Hi Alessandro,

I'm actually on my dev' computer, so I would like to post 1 000 000 xml
file (with a structure defined in my schema.xml)

I have already import 1 000 000 xml files by using
bin/post -c mydb /DATA0/1 /DATA0/2 /DATA0/3 /DATA0/4 /DATA0/5
where /DATA0/X contains 20 000 xml files (I do it 20 times by just
changing X from 1 to 50)

I would like to do now
bin/post -c mydb /DATA1

I would like to know If my SOLR5 will run fine and no provide an memory
error because there are too many files
in one post without doing a commit?

The commit will be done at the end of 1 000 000.

Is it ok ?



Le 05/06/2015 16:59, Alessandro Benedetti a écrit :

Hi Bruno,
I can not see what is your challenge.
Of course you can index your data in the flavour you want and do a commit
whenever you want…
Are those xml Solr xml ?
If not you would need to use the DIH, the extract update handler or any
custom Indexer application.
Maybe I missed your point…
Give me more details please !

Cheers

2015-06-05 15:41 GMT+01:00 Bruno Mannina <bmann...@free.fr>:

  Dear Solr Users,

I would like to post  1 000 000 records (1 records = 1 files) in one
shoot
?
and do the commit and the end.

Is it possible to do that ?

I've several directories with each 20 000 files inside.
I would like to do:
bin/post -c mydb /DATA

under DATA I have
/DATA/1/*.xml (20 000 files)
/DATA/2/*.xml (20 000 files)
/DATA/3/*.xml (20 000 files)
....
/DATA/50/*.xml (20 000 files)

Actually, I post 5 directories in one time (it takes around 1h30 for 100
000 records/files)

But it's Friday and I would like to run it during the W.E. alone.

Thanks for your comment,

Bruno

---
Ce courrier électronique ne contient aucun virus ou logiciel malveillant
parce que la protection avast! Antivirus est active.
https://www.avast.com/antivirus

---
Ce courrier électronique ne contient aucun virus ou logiciel malveillant
parce que la protection avast! Antivirus est active.
https://www.avast.com/antivirus


--
--------------------------

Benedetti Alessandro
Visiting card : http://about.me/alessandro_benedetti

"Tyger, tyger burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry?"

William Blake - Songs of Experience -1794 England



---
Ce courrier électronique ne contient aucun virus ou logiciel malveillant parce 
que la protection avast! Antivirus est active.
https://www.avast.com/antivirus

Re: Possible or not ?

Reply via email to