Hello-

I am running into some scaling performance problems with SQL that I hope a clever solr solution could fix. I've already gone through a bunch of loops, so I figure I should solicit advice before continuing to chase my tail.

I have a bunch of things (100K-500K+) that are defined by a set of user tags. ryan says: (name=xxx, location=yyy, foo=[aaa,bbb,ccc]), and alison says (name:zzz, location=bbb) - this list is constantly updating, it is fed from automated crawlers and user generated content. The 'names' can be arbitrary, but 99% of them will be ~25 distinct names.

My approach has been to build a repository of all the 'tags' and then as things come into that repository, I merge all the tags for that entry into a single 'flat' document and index it with solr.

When my thing+tag count was small, a simple SQL table with a row for each tag works great:

CREATE TABLE `my_tags` (
entryID varchar(40) NOT NULL,
  source varchar(40) NOT NULL,
  name varchar(40) NOT NULL,
  value TEXT NOT NULL,
  KEY( entryID ),
  KEY( source )
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

but as the row count gets big(2M+) this gets to be unusable. To make it tractable, I am now splitting the tags across a bunch of tables and pushing the per user name/value pairs into a single text field (stored with JSON)

CREATE TABLE `my_tags_000` (
 entryID varchar(40) NOT NULL,
 source varchar(40) NOT NULL,
 tags LONGTEXT NOT NULL,
 PRIMARY KEY( entryID, source )
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Then I pick what table that goes into using:
 Math.abs( id.hashCode() )%10

This works OK, but it is still slower then I would like. DB access is slow, and it also needs to search across the updating solr index, and that gets slow since it keeps reopening the searcher (autowarming is off!)

Soooo... I see a few paths and would love external feedback before banging my head on this longer.

1. Get help from someone who know more SQL then me and try to make a pure SQL approach work. This would need to work with 10M+ tags. Solr indexing is then a direct SQL -> solr dump.

2. Figure out how to keep the base Tuple store in solr. I think this will require finishing up SOLR-139. This would keep the the core data in solr - so there is no good way to 'rebuild' the index.

3. something else?  store input on disk?


Any thoughts / pointers / nay-saying would be really helpful!

thanks
ryan












Reply via email to