You haven't really told us _how_ you are indexing, so I'm going to
make some comments that may be irrelevant...

At 600M documents, you'll almost certainly have to shard your index.
It sounds like you're doing the sharding yourself in Lucene by having
different Lucene indexes based on date. As you indicate, this makes it
more difficult for users. SolrCloud handles this for you and is very
likely preferable.

bq:  I know that SolrCloud can solve the search problem when the index
data is big, but it`s even slower in indexing than Solr.

This is not necessarily true. In fact, SolrCloud should be much faster
_if_ you use SolrJ, and the "CloudSolrClient" in the SolrJ program to
index to Solr. Under the covers, Solr routes documents to the correct
shard based on a hash of the <uniqueKey>. CloudSolrClient sends the
documents to the correct shard automatically, so if you have 10 shards
and index a batch of, say, 1,000 documents, 10 groups of 100 docs will
be sent out in parallel, one to each shard.

Here's an example of a SolrJ program (just a Java program using Solr
libraries): https://lucidworks.com/2012/02/14/indexing-with-solrj/.
Note that this code is rather old so it uses StreamingUpdateSolrServer
where you should use CloudSolrClient. It also processes structured
documents using Tika, but you can remove those bits of the code.

One technique I use when using SolrJ is to comment out the single line
that sends the doc to Solr (_server.add(docs) in the example). This
tells me whether the bottleneck is in getting the data from the
database or indexing it to Solr. Often the bottleneck is getting the
data, but with 600M documents that may not be the case.

Once your cluster is set up, you might then be able to fire up several
indexing clients. This assumes that you can partition getting the data
from your database. Say you are indexing 10 years' of data. Fire up 10
clients each of which only indexes 1 year's worth of data.

Hope that helps,
Erick


On Tue, Mar 21, 2017 at 7:08 AM, Q&Q <793555...@qq.com> wrote:
> Dear  Sir/Madam, I am Li Wei, from China, and I`m writing to you for your 
> help. Here is the problem I encountered:
>
>
> There is a timing task set at night in our project which uses  Lucene to 
> build index for the data from Oracle database. It was working fine at the 
> beginning, however, as the index file grows bigger, the indexing work is 
> getting slower, when the data needed to be indexed is big, the timing task 
> can't be finished at night.
>
>
> To solve this problem, we take the following measure:
> We store the index data in different directories according to the time the 
> data inserted into the database. This measure can solve the indexing problem 
> in some way. However, when searching the index data, the user has to specify 
> the year when the data is created so as to search in the corresponding 
> directory,it`s a bad experience for the users.
>
>
> Then we learned that Solr is good at indexing data from database, so we 
> decide to adopt Solr into our project. But as the index data gets bigger, it 
> would also take more and more time for Solr to finish the index task. I know 
> that SolrCloud can solve the search problem when the index data is big, but 
> it`s even slower in indexing than Solr.
>
>
> So I am writing to you for help. Is there any solution for Solr to handle 
> this kind problem? There are more than six hundred million records in the 
> database right now, and data will be added into the database everyday. 
> Whether it is true that if we don't set the UniqueKey property in the 
> config.xml file, then the problem will be avoided? If so, there`s another 
> problem, the index data can be only added, but can't be updated without the 
> UniqueKey property. Could you please give me some solutions for these 
> problems?
>
>
> I am looking forward to you sincerely. Thank you very much for your time!
>
>
> Best regards,
> Li Wei

Reply via email to