On Mon, Dec 5, 2011 at 3:28 PM, Shawn Heisey <s...@elyograg.org> wrote:
> On 12/4/2011 12:41 AM, Ted Dunning wrote: > >> Read the papers I referred to. They describe how to search fairly >> enormous >> corpus with an 8GB in-memory index (and no disk cache at all). >> > > They would seem to indicate moving away from Solr. While that would not > be entirely out of the question, I don't relish coming up with a whole new > system from scratch, one part of which will mean rewriting the build system > a third time. > Yeah. That wouldn't be good. But there are lots of interesting developments happening in new index formats in Lucene. Flexible indexing is very nice. It may not help you immediately, but I think that techniques like this are going to make a huge difference before long in the Lucene world. > Off-line indexing from a flat-file dump? My guess is that you can dump to >> disk from the db faster than you can index and a single dumping thread >> might be faster than many. >> > > What I envision when I read this is doing a single pass from the database > into a file, which is then split into a number of pieces, one for each > shard, then that gets imported simultaneously into a build core for each > shard. Is that what you were thinking? > Pretty much. If you can stand up a Hadoop cluster (even just a few machines), then it can manage all of the tasking for this. > It looks like there is a way to have mysql output xml, would that be a > reasonable way to go about this? I know a little bit about handling XML in > Perl, but only by reading the entire file. Why not tab delimited data? Check to see if mySQL will escape things correctly for you. That would be faster to parse than XML. SolR may handle the XML you produce directly. I am definitely not an expert there.