Re: Recommendations for non-narrative data

Erick Erickson Fri, 16 Mar 2018 09:22:03 -0700

For an index that size, you have a lot of options. I'd completely
ignore any discussion that starts with "but our index will be bigger
if we do that" until it's proven to be a problem. For reference, I
commonly see 200G-300G indexes so....

Ok, to your problem.
Your update rate is very low so don't worry about it. In this case I'd
set my autocommit setting to as long as you can tolerate (say 15
seconds? 5 seconds?). If you can batch up your updates it'll help
(i.e. let's say you update your Solr index once a minute. Collect all
of the records that have changed in the last minute, batch them up in
a single request and send it).

If your update pattern _is_ something like above, it really doesn't
matter what your autocommit interval is since it'll only be triggered
every minute in my example. At this size/rate I wouldn't worry about
soft commits at all, just leave it out or set it to -1 (never fires).

As for your use-cases, pre-and-postfix wildcards are tricky. In the
naive case where you just index them regularly, they're quite
expensive since to find the matching terms you must enumerate all
terms in a field. However, at this size this is the first thing I'd
try, it might be fast enough. If it's not, the trick is to use ngrams
(say bigrams). So if I'm indexing "erick", it becomes "er" "ri" "ic"
"ck". Now a search for *ric* becomes simpler as it's a phrase search
for "ri" followed by "ic". Again, at your size the index increase not
a problem I'd guess.

So StandartTokenizer + LowercaseFilter + NgramFilter is where I'd
start. You'll find the admin/analysis page _extremely_ valuable for
understanding how these interact.

Do be careful to try edge cases, particularly ones involving
punctuation. You'll discover that switching to something like
WhitespaceTokenizer all of the sudden stops removing punctuation for
instance.....

Best,
Erick

On Fri, Mar 16, 2018 at 6:46 AM, Christopher Schultz
<[email protected]> wrote:
> All,
>
> I'm using Solr to index and search a database of user data (username,
> email, first and last name), so there aren't really "terms" in the data
> to search for, like you might search for words that describe products in
> a catalog, for example.
>
> I have set up my schema to include plain-old text fields for each of the
> data mentioned above, plus I have a copy-field called "all" which
> includes everything all together, plus I have a first + last field which
> uses a phonetic index and query analyzer.
>
> Since I don't need things such as term-replacement (spanner == wrench),
> stemming (first name 'chris' -> 'chri'), and possibly other features
> that I don't know about, I'm wondering what might be a recommended set
> of tokenizer(s), analyzer(s), etc. for such data.
>
> We will definitely want to be able to search by substring (to find
> 'cschultz' as a username with 'schultz' as input) but some substrings
> are probably useless (such as @gmail.com for email addresses) and don't
> need to be supported.
>
> What are some good options to look at for this type of data?
>
> In production, we have fewer than 5M records to handle, so this is more
> of an academic exercise than an actual performance requirement (since
> Solr is at least an order of magnitude faster than our current
> RDBMS-searching implementation).
>
> If it makes any difference, we are trying to keep the index up-to-date
> with all user changes made in real time (okay, maybe delayed by a few
> seconds, but basically realtime). We have a few hundred new-user
> registrations per day and probably half as many changes to user records
> as that, so perhaps 2 document-updates per minute on average (during ~12
> business hours in the US on weekdays).
>
> Thanks for any advice anyone may have,
> -chris
>

Re: Recommendations for non-narrative data

Reply via email to