On Mon, Dec 5, 2011 at 3:28 PM, Shawn Heisey <s...@elyograg.org> wrote:

> On 12/4/2011 12:41 AM, Ted Dunning wrote:
>
>> Read the papers I referred to.  They describe how to search fairly
>> enormous
>> corpus with an 8GB in-memory index (and no disk cache at all).
>>
>
> They would seem to indicate moving away from Solr.  While that would not
> be entirely out of the question, I don't relish coming up with a whole new
> system from scratch, one part of which will mean rewriting the build system
> a third time.
>

Yeah.  That wouldn't be good.  But there are lots of interesting
developments happening in new index formats in Lucene.  Flexible indexing
is very nice.

It may not help you immediately, but I think that techniques like this are
going to make a huge difference before long in the Lucene world.


> Off-line indexing from a flat-file dump?  My guess is that you can dump to
>> disk from the db faster than you can index and a single dumping thread
>> might be faster than many.
>>
>
> What I envision when I read this is doing a single pass from the database
> into a file, which is then split into a number of pieces, one for each
> shard, then that gets imported simultaneously into a build core for each
> shard.  Is that what you were thinking?
>

Pretty much.  If you can stand up a Hadoop cluster (even just a few
machines), then it can manage all of the tasking for this.


> It looks like there is a way to have mysql output xml, would that be a
> reasonable way to go about this?  I know a little bit about handling XML in
> Perl, but only by reading the entire file.


Why not tab delimited data?  Check to see if mySQL will escape things
correctly for you.  That would be faster to parse than XML.

SolR may handle the XML you produce directly.  I am definitely not an
expert there.

Reply via email to