Hello-
I am running into some scaling performance problems with SQL that I hope
a clever solr solution could fix. I've already gone through a bunch of
loops, so I figure I should solicit advice before continuing to chase my
tail.
I have a bunch of things (100K-500K+) that are defined by a set of user
tags. ryan says: (name=xxx, location=yyy, foo=[aaa,bbb,ccc]), and
alison says (name:zzz, location=bbb) - this list is constantly updating,
it is fed from automated crawlers and user generated content. The
'names' can be arbitrary, but 99% of them will be ~25 distinct names.
My approach has been to build a repository of all the 'tags' and then as
things come into that repository, I merge all the tags for that entry
into a single 'flat' document and index it with solr.
When my thing+tag count was small, a simple SQL table with a row for
each tag works great:
CREATE TABLE `my_tags` (
entryID varchar(40) NOT NULL,
source varchar(40) NOT NULL,
name varchar(40) NOT NULL,
value TEXT NOT NULL,
KEY( entryID ),
KEY( source )
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
but as the row count gets big(2M+) this gets to be unusable. To make it
tractable, I am now splitting the tags across a bunch of tables and
pushing the per user name/value pairs into a single text field (stored
with JSON)
CREATE TABLE `my_tags_000` (
entryID varchar(40) NOT NULL,
source varchar(40) NOT NULL,
tags LONGTEXT NOT NULL,
PRIMARY KEY( entryID, source )
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
Then I pick what table that goes into using:
Math.abs( id.hashCode() )%10
This works OK, but it is still slower then I would like. DB access is
slow, and it also needs to search across the updating solr index, and
that gets slow since it keeps reopening the searcher (autowarming is off!)
Soooo... I see a few paths and would love external feedback before
banging my head on this longer.
1. Get help from someone who know more SQL then me and try to make a
pure SQL approach work. This would need to work with 10M+ tags. Solr
indexing is then a direct SQL -> solr dump.
2. Figure out how to keep the base Tuple store in solr. I think this
will require finishing up SOLR-139. This would keep the the core data
in solr - so there is no good way to 'rebuild' the index.
3. something else? store input on disk?
Any thoughts / pointers / nay-saying would be really helpful!
thanks
ryan