solr tuple/tag store

Ryan McKinley Tue, 09 Oct 2007 12:23:59 -0700

Hello-

I am running into some scaling performance problems with SQL that I hopea clever solr solution could fix. I've already gone through a bunch ofloops, so I figure I should solicit advice before continuing to chase mytail.

I have a bunch of things (100K-500K+) that are defined by a set of usertags. ryan says: (name=xxx, location=yyy, foo=[aaa,bbb,ccc]), andalison says (name:zzz, location=bbb) - this list is constantly updating,it is fed from automated crawlers and user generated content. The'names' can be arbitrary, but 99% of them will be ~25 distinct names.

My approach has been to build a repository of all the 'tags' and then asthings come into that repository, I merge all the tags for that entryinto a single 'flat' document and index it with solr.

When my thing+tag count was small, a simple SQL table with a row foreach tag works great:


CREATE TABLE `my_tags` (

entryID varchar(40) NOT NULL,

  source varchar(40) NOT NULL,
  name varchar(40) NOT NULL,
  value TEXT NOT NULL,
  KEY( entryID ),
  KEY( source )
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

but as the row count gets big(2M+) this gets to be unusable. To make ittractable, I am now splitting the tags across a bunch of tables andpushing the per user name/value pairs into a single text field (storedwith JSON)


CREATE TABLE `my_tags_000` (
 entryID varchar(40) NOT NULL,
 source varchar(40) NOT NULL,
 tags LONGTEXT NOT NULL,
 PRIMARY KEY( entryID, source )
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Then I pick what table that goes into using:
 Math.abs( id.hashCode() )%10

This works OK, but it is still slower then I would like. DB access isslow, and it also needs to search across the updating solr index, andthat gets slow since it keeps reopening the searcher (autowarming is off!)

Soooo... I see a few paths and would love external feedback beforebanging my head on this longer.

1. Get help from someone who know more SQL then me and try to make apure SQL approach work. This would need to work with 10M+ tags. Solrindexing is then a direct SQL -> solr dump.

2. Figure out how to keep the base Tuple store in solr. I think thiswill require finishing up SOLR-139. This would keep the the core datain solr - so there is no good way to 'rebuild' the index.


3. something else?  store input on disk?


Any thoughts / pointers / nay-saying would be really helpful!

thanks
ryan

solr tuple/tag store

Reply via email to