On Apr 20, 2007, at 2:30 PM, solruser wrote:
For pure Ruby access to Solr without a database, use solr-ruby. The
0.01 gem is available as "gem install solr-ruby", but if you can I'd
recommend you tinker with the trunk codebase too.
Well I say, considering use of solr with rails application. Whats
the ideal
approach?.
"rails application" is a pretty broad category of applications at
this point. If we're talking about a database-backed application
being searchable by Solr, I'd go for the RubyForge acts_as_solr
first. However, I suspect that it needs work in terms of
facilitating access to facets, highlighting, and other types of
custom query handlers.
If your application is backed by other datastores, like in my cases a
bunch of MARC records in binary format, or a flat delimited file, a
ZIP file full of RDF/XML files, or even more interestingly another
Solr instance that we wanted to repurpose in another Solr-based
application, then go with solr-ruby.
It's my intention to bridge this gap in the near future somehow, I
just haven't formulated an exact plan. acts_as_solr fits nicely and
very very easily on top of solr-ruby. I envision acts_as_solr simply
being part of solr-ruby and it'd only hook in if you have
ActiveRecord installed, otherwise it'd be transparent, only taking up
a few 10's of lines of code in an un-required .rb file.
The first step could be to patch the RubyForge acts_as_solr to use
solr-ruby to kick start collaboration. As for where my effort fits
into a calendar, within the next few weeks I'll be delving into it
deeply and can speak more definitively.
Since there are many flavors floating around which is most sought
after and
supported. And I agree that definitive version will help ROR
community to
accept solr with much larger level of confidence.
And since ROR application are addressing
web2.0 the need for search and collaborate information is much
higher. So I
personally believe addressing this will definately go long way.
That's the plan! No question about it. I personally am running on
all cylinders, and will make progress on these technologies as my
real-world needs require them, which is increasing all the time. All
savvy SolRubyists are invited to jump in!
I've not documented this stuff on the wiki to the standards set by
the Solr engine itself, but there is some pretty amazing power going
on with solr-ruby right now. For example, the data mapping / indexer
framework makes this easy to import a dataset into Solr using Ruby:
source = DataSource.new
mapping = {
:id => :isbn,
:name => :author,
:source => "BOOKS",
:year => Proc.new {|record| record.date[0,4] },
}
Solr::Indexer.index(source, mapper) do |orig_data, solr_document|
solr_document[:timestamp] = Time.now
end
This showcases the simplistic data source facility (*quack* -
anything that has a #each method) [with a contrived DataSource bogus
class], and the mapping capabilities. The mapping is a hash of Solr
field names to value mapping. A value mapping can be a String
("BOOKS"), a Symbol (:isbn, :author) which looks up that field from
(uh, #)each of the objects yielded to the each block. This lookup
simply means again *quack* that the data object needs a [] method
defined. The Proc example is a bit more advanced Ruby voodoo for
embedded a bit of code into the mapping to be executed later with
actual record passed into it, and in the example it strips off the
first four characters of the records date property. And one more bit
of Ruby coolness is the do ... end block for the indexer method. The
indexer takes a data source and a mapper melding them together as
described, and allowing you one final chance to affect the
solr_document before it gets indexed, of course also provided the
original data object.
We now already have a simple mapper, an XPath mapper, and an Hpricot
mapper available. We also have some handy data sources including a
tab-delimited file source (obsoleted in my play book by the CSV
importer now built in). I'm also using a simple custom MARC binary
data source and mapper specific to ruby-marc objects, and I just put
together a SolrSource that takes a query (and filters) for one Solr
instance in a configurable paging way, that feeds documents returned
from that query successively out. Apply a mapper to that data source
and you can pipe data from one Solr to another like this:
solr_source = Solr::Importer::SolrSource.new("http://localhost:8420/
solr", "*:*", ["year:[1776 TO 1918]", 'author:smith'])
count = 0
Solr::Indexer.index(source_solr, mapper, {:debug => false, :timeout
=> 120, :solr_url => "http://localhost:8983/solr"}) do |orig_data,
solr_document|
count = count + 1
if count % 100 == 0
puts "#{count}"
end
end
The count junk is just to see console progress on how many records
have been indexed.
So I'm working the Ruby/Solr thing as much as possible right now.
There is something to what we've got there, but its not packaged as
nicely as needed for a community to flourish, and for that I
apologize. But there is also enough goodness there now to lure folks
in to want to get involved.
Right now in RoR with the Flare plugin installed, you can have a
controller that looks like this:
class SearchController < ApplicationController
flare
end
And with some copy/pasting of templates (that we can build in as
defaults somehow I'm sure) you have a faceted browsing Ajax tricked
out (well, inplace editor and Ajax suggest) experience with how many
lines of code? (the devil is in the details though, and that is why
I don't yet recommend flare to folks that just want it to just work
and also be configurable) Flare cuts a lot of corners by hard-coding
some thing that need to be made configurable, etc. Typical
prototyping approach, tinker, tinker, tinker, distill. I'm still in
the first tinker phase with Flare right now. But folks interested in
rolling up their sleeves and don't mind getting a little grubby with
code are more than invited to delve into Flare now, with the
forewarning that the flare you see today will not be at all near the
Flare that spawns from the ashes. Pioneering spirit required.
: 3. performance benchmark for acts_as_solr plugin available if any
What kind of numbers are you after? acts_as_solr searches Solr, and
then will fetch the records from the database to bring back model
objects, so you have to account for the database access in the
picture as well as Solr.
Well to be specific I am keen to know about creation and update of
indexes
when you run into large number of documents. Since database is used to
populate the models and definately it will be the commulative
effect of
retrieval of document from solr with lucene, network issues (since
its a web
service) and locally on database (depends on configuration).
Again we need to be clear about "large". I've got near 4M indexes
under my belt now, but many others have gone to 10M+. Lucene and
Solr both scale very well in the 10's of millions and even further up
into the hundreds of millions I've heard.
Certainly those other latencies you mention are valid questions, but
in my experience they've not been show-stopping concerns performance
with Solr + Ruby has been more than acceptable... it's been just
fine, even with several spots for improvement in all those areas in
my applications. First rule of optimization: Don't. Second rule of
optimization: Don't optimize yet.
Erik