Re: Solr in non-persistent mode

2014-01-25 Thread Per Steffensen
Well, we where using it in our automatic tests to make them run faster - 
so that is at least an use-case. But after upgrade to 4.4 using the new 
solr.xml-style we are not running our test-suite with Solrs in 
non-persistent mode anymore (we cant). But actually it seems like the 
test-suite is completed in almost the same time as before, so it is not 
a big issue for us.


Regards, Per Steffensen

On 1/23/14 6:09 PM, Mark Miller wrote:

Yeah, I think we removed support in the new solr.xml format. It should still 
work with the old format.

If you have a good use case for it, I don’t know that we couldn’t add it back 
with the new format.

- Mark



On Jan 23, 2014, 3:26:05 AM, Per Steffensen  wrote: Hi

In Solr 4.0.0 I used to be able to run with persistent=false (in
solr.xml). I can see
(https://cwiki.apache.org/confluence/display/solr/Format+of+solr.xml)
that persistent is no longer supported in solr.xml. Does this mean that
you cannot run in non-persistent mode any longer, or does it mean that I
have to configure it somewhere else?

Thanks!

Regards, Per Steffensen





Re: Solr server requirements for 100+ million documents

2014-01-25 Thread svante karlsson
We are using a postgres server on a different host (same hardware as the
test solr server). The reason we take the data from the postgres server is
that is easy to automate testing since we use the same server to produce
queries. In production we preload the solr from a csv file from a hive
(hadoop) job and then only write updates ( < 500 / sec ). In our usecase we
use solr as NoSQL dabase since we really want to do SHOULD queries against
all the fields. The fields are typically very small text fields (<30 chars)
but occasionally bigger but I don't think I have more than 128 chars on
anything in the whole dataset.



  
  
  
   
   
   
   
   






















id







2014/1/25 Kranti Parisa 

> can you post the complete solrconfig.xml file and schema.xml files to
> review all of your settings that would impact your indexing performance.
>
> Thanks,
> Kranti K. Parisa
> http://www.linkedin.com/in/krantiparisa
>
>
>
> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar <
> susheel.ku...@thedigitalgroup.net> wrote:
>
> > Thanks, Svante. Your indexing speed using db seems to really fast. Can
> you
> > please provide some more detail on how you are indexing db records. Is it
> > thru DataImportHandler? And what database? Is that local db?  We are
> > indexing around 70 fields (60 multivalued) but data is not populated
> always
> > in all fields. The average size of document is in 5-10 kbs.
> >
> > -Original Message-
> > From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf Of
> > svante karlsson
> > Sent: Friday, January 24, 2014 5:05 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr server requirements for 100+ million documents
> >
> > I just indexed 100 million db docs (records) with 22 fields (4
> > multivalued) in 9524 sec using libcurl.
> > 11 million took 763 seconds so the speed drops somewhat with increasing
> > dbsize.
> >
> > We write 1000 docs (just an arbitrary number) in each request from two
> > threads. If you will be using solrcloud you will want more writer
> threads.
> >
> > The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one
> SSD
> > and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.
> >
> > /svante
> >
> >
> >
> >
> > 2014/1/24 Susheel Kumar 
> >
> > > Thanks, Erick for the info.
> > >
> > > For indexing I agree the more time is consumed in data acquisition
> > > which in our case from Database.  For indexing currently we are using
> > > the manual process i.e. Solr dashboard Data Import but now looking to
> > > automate.  How do you suggest to automate the index part. Do you
> > > recommend to use SolrJ or should we try to automate using Curl?
> > >
> > >
> > > -Original Message-
> > > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > > Sent: Friday, January 24, 2014 2:59 PM
> > > To: solr-user@lucene.apache.org
> > > Subject: Re: Solr server requirements for 100+ million documents
> > >
> > > Can't be done with the information you provided, and can only be
> > > guessed at even with more comprehensive information.
> > >
> > > Here's why:
> > >
> > >
> > > http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we
> > > -dont-have-a-definitive-answer/
> > >
> > > Also, at a guess, your indexing speed is so slow due to data
> > > acquisition; I rather doubt you're being limited by raw Solr indexing.
> > > If you're using SolrJ, try commenting out the
> > > server.add() bit and running again. My guess is that your indexing
> > > speed will be almost unchanged, in which case it's the data
> > > acquisition process is where you should concentrate efforts. As a
> > > comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes
> > > without any attempts at parallelization.
> > >
> > >
> > > Best,
> > > Erick
> > >
> > > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar <
> > > susheel.ku...@thedigitalgroup.net> wrote:
> > > > Hi,
> > > >
> > > > Currently we are indexing 10 million document from database (10 db
> > > > data
> > > entities) & index size is around 8 GB on windows virtual box. Indexing
> > > in one shot taking 12+ hours while indexing parallel in separate cores
> > > & merging them together taking 4+ hours.
> > > >
> > > > We are looking to scale to 100+ million documents and looking for
> > > recommendation on servers requirements on below parameters for a
> > > Production environment. There can be 200+ users performing search same
> > time.
> > > >
> > > > No of physical servers (considering solr cloud) Memory requirement
> > > > Processor requirement (# cores) Linux as OS oppose to windows
> > > >
> > > > Thanks in advance.
> > > > Susheel
> > > >
> > >
> >
>


Re: Solr server requirements for 100+ million documents

2014-01-25 Thread svante karlsson
That got away a little early...

The inserter is a small C++ program that uses pglib to speek to postgres
and the a http-client library that uses libcurl under the hood. The
inserter draws very little CPU and we normally use 2 writer threads that
each posts 1000 records at a time. Its very inefficient to post one at a
time but I've not done any specific testing to know if 1000 is better that
500

What we're doing now is trying to figure out how to get the query
performance up since is not where we need it to be so we're not done
either...


2014/1/25 svante karlsson 

> We are using a postgres server on a different host (same hardware as the
> test solr server). The reason we take the data from the postgres server is
> that is easy to automate testing since we use the same server to produce
> queries. In production we preload the solr from a csv file from a hive
> (hadoop) job and then only write updates ( < 500 / sec ). In our usecase we
> use solr as NoSQL dabase since we really want to do SHOULD queries against
> all the fields. The fields are typically very small text fields (<30 chars)
> but occasionally bigger but I don't think I have more than 128 chars on
> anything in the whole dataset.
>
> 
> 
>   
>   
>omitNorms="true"/>
> sortMissingLast="true"/>
> positionIncrementGap="0"/>
> positionIncrementGap="0"/>
> positionIncrementGap="0"/>
>
> 
>  multiValued="false"/>
>  required="true" multiValued="false" />
> 
> 
> 
> 
> 
> 
>  multiValued="true"/>
>  multiValued="true"/>
>  multiValued="true"/>
>  multiValued="true"/>
>  multiValued="true"/>
>  multiValued="true"/>
> 
>  multiValued="true"/>
> 
>
>  required="false" />
> 
> 
> id
> 
> 
>
>
>
>
>
> 2014/1/25 Kranti Parisa 
>
>> can you post the complete solrconfig.xml file and schema.xml files to
>> review all of your settings that would impact your indexing performance.
>>
>> Thanks,
>> Kranti K. Parisa
>> http://www.linkedin.com/in/krantiparisa
>>
>>
>>
>> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar <
>> susheel.ku...@thedigitalgroup.net> wrote:
>>
>> > Thanks, Svante. Your indexing speed using db seems to really fast. Can
>> you
>> > please provide some more detail on how you are indexing db records. Is
>> it
>> > thru DataImportHandler? And what database? Is that local db?  We are
>> > indexing around 70 fields (60 multivalued) but data is not populated
>> always
>> > in all fields. The average size of document is in 5-10 kbs.
>> >
>> > -Original Message-
>> > From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf Of
>> > svante karlsson
>> > Sent: Friday, January 24, 2014 5:05 PM
>> > To: solr-user@lucene.apache.org
>> > Subject: Re: Solr server requirements for 100+ million documents
>> >
>> > I just indexed 100 million db docs (records) with 22 fields (4
>> > multivalued) in 9524 sec using libcurl.
>> > 11 million took 763 seconds so the speed drops somewhat with increasing
>> > dbsize.
>> >
>> > We write 1000 docs (just an arbitrary number) in each request from two
>> > threads. If you will be using solrcloud you will want more writer
>> threads.
>> >
>> > The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one
>> SSD
>> > and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual
>> machine.
>> >
>> > /svante
>> >
>> >
>> >
>> >
>> > 2014/1/24 Susheel Kumar 
>> >
>> > > Thanks, Erick for the info.
>> > >
>> > > For indexing I agree the more time is consumed in data acquisition
>> > > which in our case from Database.  For indexing currently we are using
>> > > the manual process i.e. Solr dashboard Data Import but now looking to
>> > > automate.  How do you suggest to automate the index part. Do you
>> > > recommend to use SolrJ or should we try to automate using Curl?
>> > >
>> > >
>> > > -Original Message-
>> > > From: Erick Erickson [mailto:erickerick...@gmail.com]
>> > > Sent: Friday, January 24, 2014 2:59 PM
>> > > To: solr-user@lucene.apache.org
>> > > Subject: Re: Solr server requirements for 100+ million documents
>> > >
>> > > Can't be done with the information you provided, and can only be
>> > > guessed at even with more comprehensive information.
>> > >
>> > > Here's why:
>> > >
>> > >
>> > >
>> http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we
>> > > -dont-have-a-definitive-answer/
>> > >
>> > > Also, at a guess, your indexing speed is so slow due to data
>> > > acquisition; I rather doubt you're being limited by raw Solr indexing.
>> > > If you're using SolrJ, try commenting out the
>> > > server.add() bit and running again. My guess is that your indexing
>> > > speed will be almost unchanged, in which case it's the data
>> > > acquisition process is where you should concentrate efforts. As a
>> > > comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes
>> > > without any attempts at parallelization.
>> > >
>> > >
>> > > Best,
>> > > Erick
>> > >
>> > > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar <
>> > > 

Re: Solr server requirements for 100+ million documents

2014-01-25 Thread Erick Erickson
Hmmm, I'm always suspicious when I see a schema.xml with a lot of "string"
types. This is tangential to your question, but I thought I'd butt in anyway.

String types are totally unanalyzed. So if the input for a field is "I
like Strings",
the only match will be "I like Strings". "I like strings" won't match
due to the
lower-case 's' in strings. "like" won't match since it isn't the complete input.

You may already know this, but thought I'd point it out. For tokenized
searches, text_general is a good place to start. Pardon me if this is repeating
what you already know

Lots of string types sometimes lead people with DB backgrounds to
search for *like* which will be slow FWIW.

Best,
Erick

On Sat, Jan 25, 2014 at 5:51 AM, svante karlsson  wrote:
> That got away a little early...
>
> The inserter is a small C++ program that uses pglib to speek to postgres
> and the a http-client library that uses libcurl under the hood. The
> inserter draws very little CPU and we normally use 2 writer threads that
> each posts 1000 records at a time. Its very inefficient to post one at a
> time but I've not done any specific testing to know if 1000 is better that
> 500
>
> What we're doing now is trying to figure out how to get the query
> performance up since is not where we need it to be so we're not done
> either...
>
>
> 2014/1/25 svante karlsson 
>
>> We are using a postgres server on a different host (same hardware as the
>> test solr server). The reason we take the data from the postgres server is
>> that is easy to automate testing since we use the same server to produce
>> queries. In production we preload the solr from a csv file from a hive
>> (hadoop) job and then only write updates ( < 500 / sec ). In our usecase we
>> use solr as NoSQL dabase since we really want to do SHOULD queries against
>> all the fields. The fields are typically very small text fields (<30 chars)
>> but occasionally bigger but I don't think I have more than 128 chars on
>> anything in the whole dataset.
>>
>> 
>> 
>>   
>>   
>>   > omitNorms="true"/>
>>> sortMissingLast="true"/>
>>> positionIncrementGap="0"/>
>>> positionIncrementGap="0"/>
>>> positionIncrementGap="0"/>
>>
>> 
>> > multiValued="false"/>
>> > required="true" multiValued="false" />
>> 
>> 
>> 
>> 
>> 
>> 
>> > multiValued="true"/>
>> > multiValued="true"/>
>> > multiValued="true"/>
>> > multiValued="true"/>
>> > multiValued="true"/>
>> > multiValued="true"/>
>> 
>> > multiValued="true"/>
>> 
>>
>> > required="false" />
>> 
>> 
>> id
>> 
>> 
>>
>>
>>
>>
>>
>> 2014/1/25 Kranti Parisa 
>>
>>> can you post the complete solrconfig.xml file and schema.xml files to
>>> review all of your settings that would impact your indexing performance.
>>>
>>> Thanks,
>>> Kranti K. Parisa
>>> http://www.linkedin.com/in/krantiparisa
>>>
>>>
>>>
>>> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar <
>>> susheel.ku...@thedigitalgroup.net> wrote:
>>>
>>> > Thanks, Svante. Your indexing speed using db seems to really fast. Can
>>> you
>>> > please provide some more detail on how you are indexing db records. Is
>>> it
>>> > thru DataImportHandler? And what database? Is that local db?  We are
>>> > indexing around 70 fields (60 multivalued) but data is not populated
>>> always
>>> > in all fields. The average size of document is in 5-10 kbs.
>>> >
>>> > -Original Message-
>>> > From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf Of
>>> > svante karlsson
>>> > Sent: Friday, January 24, 2014 5:05 PM
>>> > To: solr-user@lucene.apache.org
>>> > Subject: Re: Solr server requirements for 100+ million documents
>>> >
>>> > I just indexed 100 million db docs (records) with 22 fields (4
>>> > multivalued) in 9524 sec using libcurl.
>>> > 11 million took 763 seconds so the speed drops somewhat with increasing
>>> > dbsize.
>>> >
>>> > We write 1000 docs (just an arbitrary number) in each request from two
>>> > threads. If you will be using solrcloud you will want more writer
>>> threads.
>>> >
>>> > The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one
>>> SSD
>>> > and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual
>>> machine.
>>> >
>>> > /svante
>>> >
>>> >
>>> >
>>> >
>>> > 2014/1/24 Susheel Kumar 
>>> >
>>> > > Thanks, Erick for the info.
>>> > >
>>> > > For indexing I agree the more time is consumed in data acquisition
>>> > > which in our case from Database.  For indexing currently we are using
>>> > > the manual process i.e. Solr dashboard Data Import but now looking to
>>> > > automate.  How do you suggest to automate the index part. Do you
>>> > > recommend to use SolrJ or should we try to automate using Curl?
>>> > >
>>> > >
>>> > > -Original Message-
>>> > > From: Erick Erickson [mailto:erickerick...@gmail.com]
>>> > > Sent: Friday, January 24, 2014 2:59 PM
>>> > > To: solr-user@lucene.apache.org
>>> > > Subject: Re: Solr server requirements for 100+ million documents
>>> > >
>>> > > Can't be done with th

Re: Solr server requirements for 100+ million documents

2014-01-25 Thread svante karlsson
You are of course right but we do our own normalization (among other things
"to_lower") before we insert and before search queries get entered.

We do not use wildcards in searches either so in our problem domain it
works quite well.

/svante




2014/1/25 Erick Erickson 

> Hmmm, I'm always suspicious when I see a schema.xml with a lot of "string"
> types. This is tangential to your question, but I thought I'd butt in
> anyway.
>
> String types are totally unanalyzed. So if the input for a field is "I
> like Strings",
> the only match will be "I like Strings". "I like strings" won't match
> due to the
> lower-case 's' in strings. "like" won't match since it isn't the complete
> input.
>
> You may already know this, but thought I'd point it out. For tokenized
> searches, text_general is a good place to start. Pardon me if this is
> repeating
> what you already know
>
> Lots of string types sometimes lead people with DB backgrounds to
> search for *like* which will be slow FWIW.
>
> Best,
> Erick
>
> On Sat, Jan 25, 2014 at 5:51 AM, svante karlsson  wrote:
> > That got away a little early...
> >
> > The inserter is a small C++ program that uses pglib to speek to postgres
> > and the a http-client library that uses libcurl under the hood. The
> > inserter draws very little CPU and we normally use 2 writer threads that
> > each posts 1000 records at a time. Its very inefficient to post one at a
> > time but I've not done any specific testing to know if 1000 is better
> that
> > 500
> >
> > What we're doing now is trying to figure out how to get the query
> > performance up since is not where we need it to be so we're not done
> > either...
> >
> >
> > 2014/1/25 svante karlsson 
> >
> >> We are using a postgres server on a different host (same hardware as the
> >> test solr server). The reason we take the data from the postgres server
> is
> >> that is easy to automate testing since we use the same server to produce
> >> queries. In production we preload the solr from a csv file from a hive
> >> (hadoop) job and then only write updates ( < 500 / sec ). In our
> usecase we
> >> use solr as NoSQL dabase since we really want to do SHOULD queries
> against
> >> all the fields. The fields are typically very small text fields (<30
> chars)
> >> but occasionally bigger but I don't think I have more than 128 chars on
> >> anything in the whole dataset.
> >>
> >> 
> >> 
> >>   
> >>   
> >>>> omitNorms="true"/>
> >> >> sortMissingLast="true"/>
> >> >> positionIncrementGap="0"/>
> >> >> positionIncrementGap="0"/>
> >> >> positionIncrementGap="0"/>
> >>
> >> 
> >>  >> multiValued="false"/>
> >>  >> required="true" multiValued="false" />
> >> 
> >> 
> >> 
> >> 
> >> 
> >> 
> >>  >> multiValued="true"/>
> >>  >> multiValued="true"/>
> >>  >> multiValued="true"/>
> >>  >> multiValued="true"/>
> >>  >> multiValued="true"/>
> >>  >> multiValued="true"/>
> >> 
> >>  >> multiValued="true"/>
> >> 
> >>
> >>  >> required="false" />
> >> 
> >> 
> >> id
> >> 
> >> 
> >>
> >>
> >>
> >>
> >>
> >> 2014/1/25 Kranti Parisa 
> >>
> >>> can you post the complete solrconfig.xml file and schema.xml files to
> >>> review all of your settings that would impact your indexing
> performance.
> >>>
> >>> Thanks,
> >>> Kranti K. Parisa
> >>> http://www.linkedin.com/in/krantiparisa
> >>>
> >>>
> >>>
> >>> On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar <
> >>> susheel.ku...@thedigitalgroup.net> wrote:
> >>>
> >>> > Thanks, Svante. Your indexing speed using db seems to really fast.
> Can
> >>> you
> >>> > please provide some more detail on how you are indexing db records.
> Is
> >>> it
> >>> > thru DataImportHandler? And what database? Is that local db?  We are
> >>> > indexing around 70 fields (60 multivalued) but data is not populated
> >>> always
> >>> > in all fields. The average size of document is in 5-10 kbs.
> >>> >
> >>> > -Original Message-
> >>> > From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On
> Behalf Of
> >>> > svante karlsson
> >>> > Sent: Friday, January 24, 2014 5:05 PM
> >>> > To: solr-user@lucene.apache.org
> >>> > Subject: Re: Solr server requirements for 100+ million documents
> >>> >
> >>> > I just indexed 100 million db docs (records) with 22 fields (4
> >>> > multivalued) in 9524 sec using libcurl.
> >>> > 11 million took 763 seconds so the speed drops somewhat with
> increasing
> >>> > dbsize.
> >>> >
> >>> > We write 1000 docs (just an arbitrary number) in each request from
> two
> >>> > threads. If you will be using solrcloud you will want more writer
> >>> threads.
> >>> >
> >>> > The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with
> one
> >>> SSD
> >>> > and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual
> >>> machine.
> >>> >
> >>> > /svante
> >>> >
> >>> >
> >>> >
> >>> >
> >>> > 2014/1/24 Susheel Kumar 
> >>> >
> >>> > > Thanks, Erick for the info.
> >>> > >
> >>> > > For indexing I agree the more time is consumed in data acquisition
> 

Re: Replica not consistent after update request?

2014-01-25 Thread Nathan Neulinger

Ok, so our issue sounds like a combination of not having softCommits properly 
done, combined with SOLR-4260.

Thanks everyone!

On 01/24/2014 11:04 PM, Erick Erickson wrote:

Right. There updates are guaranteed to be on the replicas and in their
transaction logs. That doesn't mean they're searchable, however. For a
document to be found in a search there must be a commit, either soft,
or hard with openSearcher=true. Here's a post that outlines all this.



If you have discrepancies when after commits, that's a problem

Best,
Erick

On Fri, Jan 24, 2014 at 8:52 PM, Nathan Neulinger  wrote:

How can we issue an update request and be certain that all of the replicas
in the SolrCloud cluster are up to date?

I found this post:

 http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/79886

which seems to indicate that all replicas for a shard must finish/succeed
before it returns to client that the operation succeeded - but we've been
seeing behavior lately (until we configured automatic soft commits) where
the replicas were almost always "not current" - i.e. the replicas were
missing documents/etc.

Is this something wrong with our cloud setup/replication, or am I
misinterpreting the way that updates in a cloud deployment are supposed to
function?

If it's a problem with our cloud setup, do you have any suggestions on
diagnostics?

Alternatively, are we perhaps just using it wrong?

-- Nathan


Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


--

Nathan Neulinger   nn...@neulinger.org
Neulinger Consulting   (573) 612-1412


How to handle multiple sub second updates to same SOLR Document

2014-01-25 Thread christopher palm
I have a scenario where the same SOLR document is being updated several
times within a few ms of each other due to how the source system is sending
in field updates on the document.

The problem I am trying to solve is that the order of these updates isn’t
guaranteed once the multi threaded SOLRJ client starts sending them to
SOLR, and older updates are overlaying the newer updates on the same
document.

I would like to use a timestamp versioning so that the older document
change won’t be sent into SOLR, but I didn’t see any automated way of doing
this based on the document timestamp.

Is there a good way to handle this scenario in SOLR 4.6?

It seems that we would have to be soft auto committing with a  subsecond
level as well, is that even possible?

Thanks,

Chris


RE: Solr server requirements for 100+ million documents

2014-01-25 Thread Susheel Kumar
Hi Kranti,

Attach are the solrconfig & schema xml for review. I did run indexing with just 
few fields (5-6 fields) in schema.xml & keeping the same db config but Indexing 
almost still taking similar time (average 1 million records 1 hr) which 
confirms that the bottleneck is in the data acquisition which in our case is 
oracle database. I am thinking to not use dataimporthandler / jdbc to get data 
from Oracle but to rather dump data somehow from oracle using SQL loader and 
then index it. Any thoughts? 

Thnx

-Original Message-
From: Kranti Parisa [mailto:kranti.par...@gmail.com] 
Sent: Saturday, January 25, 2014 12:08 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

can you post the complete solrconfig.xml file and schema.xml files to review 
all of your settings that would impact your indexing performance.

Thanks,
Kranti K. Parisa
http://www.linkedin.com/in/krantiparisa



On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar < 
susheel.ku...@thedigitalgroup.net> wrote:

> Thanks, Svante. Your indexing speed using db seems to really fast. Can 
> you please provide some more detail on how you are indexing db 
> records. Is it thru DataImportHandler? And what database? Is that 
> local db?  We are indexing around 70 fields (60 multivalued) but data 
> is not populated always in all fields. The average size of document is in 
> 5-10 kbs.
>
> -Original Message-
> From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf 
> Of svante karlsson
> Sent: Friday, January 24, 2014 5:05 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Solr server requirements for 100+ million documents
>
> I just indexed 100 million db docs (records) with 22 fields (4
> multivalued) in 9524 sec using libcurl.
> 11 million took 763 seconds so the speed drops somewhat with 
> increasing dbsize.
>
> We write 1000 docs (just an arbitrary number) in each request from two 
> threads. If you will be using solrcloud you will want more writer threads.
>
> The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one 
> SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.
>
> /svante
>
>
>
>
> 2014/1/24 Susheel Kumar 
>
> > Thanks, Erick for the info.
> >
> > For indexing I agree the more time is consumed in data acquisition 
> > which in our case from Database.  For indexing currently we are 
> > using the manual process i.e. Solr dashboard Data Import but now 
> > looking to automate.  How do you suggest to automate the index part. 
> > Do you recommend to use SolrJ or should we try to automate using Curl?
> >
> >
> > -Original Message-
> > From: Erick Erickson [mailto:erickerick...@gmail.com]
> > Sent: Friday, January 24, 2014 2:59 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Solr server requirements for 100+ million documents
> >
> > Can't be done with the information you provided, and can only be 
> > guessed at even with more comprehensive information.
> >
> > Here's why:
> >
> >
> > http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-
> > we
> > -dont-have-a-definitive-answer/
> >
> > Also, at a guess, your indexing speed is so slow due to data 
> > acquisition; I rather doubt you're being limited by raw Solr indexing.
> > If you're using SolrJ, try commenting out the
> > server.add() bit and running again. My guess is that your indexing 
> > speed will be almost unchanged, in which case it's the data 
> > acquisition process is where you should concentrate efforts. As a 
> > comparison, I can index 11M Wikipedia docs on my laptop in 45 
> > minutes without any attempts at parallelization.
> >
> >
> > Best,
> > Erick
> >
> > On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar < 
> > susheel.ku...@thedigitalgroup.net> wrote:
> > > Hi,
> > >
> > > Currently we are indexing 10 million document from database (10 db 
> > > data
> > entities) & index size is around 8 GB on windows virtual box. 
> > Indexing in one shot taking 12+ hours while indexing parallel in 
> > separate cores & merging them together taking 4+ hours.
> > >
> > > We are looking to scale to 100+ million documents and looking for
> > recommendation on servers requirements on below parameters for a 
> > Production environment. There can be 200+ users performing search 
> > same
> time.
> > >
> > > No of physical servers (considering solr cloud) Memory requirement 
> > > Processor requirement (# cores) Linux as OS oppose to windows
> > >
> > > Thanks in advance.
> > > Susheel
> > >
> >
>


solrconfig.xml
Description: solrconfig.xml


schema.xml
Description: schema.xml


Re: How to handle multiple sub second updates to same SOLR Document

2014-01-25 Thread Shalin Shekhar Mangar
There is no timestamp versioning as such in Solr but there is a new
document based versioning which will allow you to specify your own
(externally assigned) versions.

See the "Document Centric Versioning Constraints" section at
https://cwiki.apache.org/confluence/display/solr/Updating+Parts+of+Documents

Sub-second soft auto commit can be expensive but it is hard to say if
it will be too expensive for your use-case. You must benchmark it
yourself.

On Sat, Jan 25, 2014 at 11:51 PM, christopher palm  wrote:
> I have a scenario where the same SOLR document is being updated several
> times within a few ms of each other due to how the source system is sending
> in field updates on the document.
>
> The problem I am trying to solve is that the order of these updates isn’t
> guaranteed once the multi threaded SOLRJ client starts sending them to
> SOLR, and older updates are overlaying the newer updates on the same
> document.
>
> I would like to use a timestamp versioning so that the older document
> change won’t be sent into SOLR, but I didn’t see any automated way of doing
> this based on the document timestamp.
>
> Is there a good way to handle this scenario in SOLR 4.6?
>
> It seems that we would have to be soft auto committing with a  subsecond
> level as well, is that even possible?
>
> Thanks,
>
> Chris



-- 
Regards,
Shalin Shekhar Mangar.