Re: Solr or SQL fultext search

2011-12-07 Thread Hector Castro
This article shouldn't flat out make the decision for you, but these concerns 
raised by the guys at StackOverflow (over SQL Server 2008) helped guide us 
toward Solr:

http://www.infoq.com/news/2008/11/SQL-Server-Text

--
Hector

On Dec 7, 2011, at 2:17 AM, Mersad wrote:

> hi Everyone,
> 
> I am wondering how much benefit I get if I move from SQL server to Solr in my 
> text -baed search project.
> Any help is apprechiated !
> 
> 
> best
> Mersad



Re: Looking for a good Text on Solr

2011-12-16 Thread Hector Castro
Hi Shiv, 

For me, a combination of the following has helped me learn a lot about Solr in 
a short period of time:

* Apache Solr 3 Enterprise Search Server: 
http://www.packtpub.com/apache-solr-3-enterprise-search-server/book
* Solr Wiki: http://wiki.apache.org/solr/
* Pretty much every single post on this blog: 
http://www.hathitrust.org/blogs/large-scale-search

Hope this helps,

-- 
Hector


On Friday, December 16, 2011 at 9:01 PM, Shiv Deepak wrote:

> I am looking for a good book to read from and get a better understanding of 
> solr.
> 
> On amazon, all the books on Solr have average rating (which I supposed no one 
> tried them or bothered to post a review) but this one: "Solr 1.4 Enterprise 
> Search Server by David Smiley, Eric Pugh" has a pretty decent review. But the 
> current version of Solr is 3.5, so should I proceed with David Smiley's book 
> or is there a better text available.
> 
> Thanks,
> Shiv Deepak
> 
> 




Solr core as a dispatcher

2012-01-09 Thread Hector Castro
Hi,

Has anyone had success with multicore single node Solr configurations that have 
one core acting solely as a dispatcher for the other cores?  For example, say 
you had 4 populated Solr cores – configure a 5th to be the definitive endpoint 
with `shards` containing cores 1-4.  

Is there any advantage to this setup over simply having requests distributed 
randomly across the 4 populated cores (all with `shards` equal to cores 1-4)?  
Is it even worth distributing requests across the cores over always hitting the 
same one?

Thanks,  

--  
Hector



Re: Solr core as a dispatcher

2012-01-10 Thread Hector Castro
In my case the cores are populated with different records that adhere to the 
same schema. The question about randomly distributing requests is because each 
core has the `shards` parameter populated so that it can hit the other core's 
indexes.

My question is more about the advantages (if any) of utilizing a dispatcher 
core vs. simply querying the populated cores. 

--
Hector

On Jan 10, 2012, at 1:57 AM, shlomi java  wrote:

> If you want to randomly distribute requests across shards, then I think
> it's a case of Replication.
> 
> In Replication setup, all cores have the same schema AND data, so query any
> core should return the same result. It is used to support heavy load. Of
> course such setup will required some kind of load balancer.
> 
> In Distributed Search the shards have the same schema, but NOT the same
> data. So there is no point of randomly querying a shard, because we will
> get randomly different results.
> 
> ShlomiJ
> 
> On Tue, Jan 10, 2012 at 2:15 AM, Hector Castro  wrote:
> 
>> Hi,
>> 
>> Has anyone had success with multicore single node Solr configurations that
>> have one core acting solely as a dispatcher for the other cores?  For
>> example, say you had 4 populated Solr cores – configure a 5th to be the
>> definitive endpoint with `shards` containing cores 1-4.
>> 
>> Is there any advantage to this setup over simply having requests
>> distributed randomly across the 4 populated cores (all with `shards` equal
>> to cores 1-4)?  Is it even worth distributing requests across the cores
>> over always hitting the same one?
>> 
>> Thanks,
>> 
>> --
>> Hector
>> 
>> 


Re: Solr core as a dispatcher

2012-01-11 Thread Hector Castro
In our setup, we handle the document distribution and uniqueness across cores 
outside of Solr.

--
Hector

On Jan 11, 2012, at 1:53 AM, shlomi java wrote:

> Straying a bit from the subject,
> 
> don't you think it will be useful to have the shards parameter used also in
> the index, in order to maintain document uniqueness?
> I mean as an out of the box feature of Solr.
> 
> Because the situation today is that a Solr's client working with a sharded
> Solr is responsible for keeping a document uniqueness across all shards.
> 
> *Solution *- let Solr decide in which shard to index a document, using a
> plugable hashing method.
> 
> What do you think?
> 
> ShlomiJ
> 
> On Tue, Jan 10, 2012 at 6:15 PM, Shawn Heisey  wrote:
> 
>> On 1/9/2012 5:15 PM, Hector Castro wrote:
>> 
>>> Hi,
>>> 
>>> Has anyone had success with multicore single node Solr configurations
>>> that have one core acting solely as a dispatcher for the other cores?  For
>>> example, say you had 4 populated Solr cores – configure a 5th to be the
>>> definitive endpoint with `shards` containing cores 1-4.
>>> 
>>> Is there any advantage to this setup over simply having requests
>>> distributed randomly across the 4 populated cores (all with `shards` equal
>>> to cores 1-4)?  Is it even worth distributing requests across the cores
>>> over always hitting the same one?
>>> 
>> 
>> I've got a setup where a single index chain consists of seven cores across
>> two servers.  Those seven cores do not have the shards parameter in them.
>> I have what you call a dispatcher core (I call it a broker core) that
>> contains the shards parameter, but has no index data.
>> 
>> If you use a dispatcher core, your application does not need to be
>> concerned with the makeup of your index, so you don't need to include a
>> shards parameter with your request.  You can change the index distribution
>> and not have to worry about your application configuration.  This is the
>> major advantage to doing it this way.  Distributed search has some overhead
>> and not all Solr features work with it, so if your application already
>> knows which core will contain the data it is trying to find, it is better
>> to query the right core directly.
>> 
>> Be careful that you aren't adding a shards parameter to a core
>> configuration that points at itself.  Solr will, as of the last time I
>> checked, try to complete the recursion and will fail.
>> 
>> Thanks,
>> Shawn
>> 
>> 



Re: Solr core as a dispatcher

2012-01-11 Thread Hector Castro
Thanks for the reply, Ken – it was your training session that brought the 
dispatcher core approach to my attention in the first place.  

Regarding your deep query point, if you're in a situation where numFound=5000 
and you're trying to output all 5000 records at once – your point seems to 
suggest that you're better off setting rows=5000 instead of chunking by 100.  
Is that correct?   

--  
Hector


On Wednesday, January 11, 2012 at 7:10 PM, Ken Krugler wrote:

> Hi Hector,
>  
> On Jan 9, 2012, at 4:15pm, Hector Castro wrote:
>  
> > Hi,
> >  
> > Has anyone had success with multicore single node Solr configurations that 
> > have one core acting solely as a dispatcher for the other cores? For 
> > example, say you had 4 populated Solr cores – configure a 5th to be the 
> > definitive endpoint with `shards` containing cores 1-4.  
> >  
> > Is there any advantage to this setup over simply having requests 
> > distributed randomly across the 4 populated cores (all with `shards` equal 
> > to cores 1-4)? Is it even worth distributing requests across the cores over 
> > always hitting the same one?
>  
> If you have low query rates, then using a shards approach can improve 
> performance on a multi-core (CPUs here, not Solr cores) setup.
>  
> By distributing the requests, you effectively use all CPU cores in parallel 
> on one request.
>  
> And if you spread your shards across spindles, then you're also maximizing 
> I/O throughput.
>  
> But there are a few issues with this approach:
>  
> - binary fields don't work. The results come back as "@B[]", 
> versus the actual data.
> - short fields get "java.lang.Short" text prefixed on every value.
> - deep queries result in lots of extra load. E.g. if you want the 5000th hit 
> then you'll get (5000 * # of shards) hits being collected/returned to the 
> dispatcher. Though only the unique id & score is returned in this case, 
> followed by the second request to get the actual top N hits from the shards.
>  
> And there's something wonky with the way that distributed HTTP requests are 
> queued up & processed - under load, I see IOExceptions where it's always N-1 
> shards that succeed, and one shard request fails. But I don't have a good 
> reproducible case yet to debug.
>  
> -- Ken
>  
> --
> Ken Krugler
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Mahout & Solr
>  
>  




Re: How to get the time document was indexed?

2012-01-20 Thread Hector Castro
As Tommaso said, adding a field to the schema.xml gives you an automatic 
timestamp set at index time.  The default schema.xml with Solr 3.5.0 has a 
commented example:



--
Hector

On Jan 20, 2012, at 8:15 AM, Tommaso Teofili wrote:

> Hi Alex,
> you can create a field in the schema.xml of type date or tdate called
> (something like) idx_timestamp and set its default option to NOW then you
> won't have to add any extra fields to the documents because it will be
> automatically created when documents are indexed.
> Hope it helps.
> Tommaso
> 
> 2012/1/20 ola nowak 
> 
>> Hi,
>> I want to be able to tell when the document was indexed, so I could
>> re-index it if it has changed in the meantime. Is there an easy way to do
>> this? Or I have to manualy put the date in the document and add a new field
>> in schema?
>> Thanks,
>> Alex
>>