Re: Does docValues impact termfreq ?

Erick Erickson Mon, 26 Oct 2015 20:55:30 -0700

Do be aware that docValues can only be used for non-text types,
i.e. numerics, strings and the like. Specifically, docValues are
_not_ possible for solr.textField and docValues don't support
analysis chains because the underlying primitive types don't. You'll
get an error if you try to specify docValues on a solr.TextField
type.


Does that change the discussion?

Best,
Erick

On Mon, Oct 26, 2015 at 7:36 AM, Emir Arnautovic
<emir.arnauto...@sematext.com> wrote:
> Hi Aki,
> IMO this is underuse of Solr (not to mention SolrCloud). I would recommend
> doing in memory document parsin (if you need something from Lucene/Solr
> analysis classes, use it) and use some other cache like solution to store
> term/total frequency pairs (you can try Redis).
>
> That way you will have updatable, fast total frequency lookups.
>
> Thanks,
> Emir
>
> On 26.10.2015 14:43, Aki Balogh wrote:
>>
>> Hi Emir,
>>
>> This is correct. This is the only way we use the index.
>>
>> Thanks,
>> Aki
>>
>> On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic <
>> emir.arnauto...@sematext.com> wrote:
>>
>>> If I got it right, you are using term query, use function to get TF as
>>> score, iterate all documents in results and sum up total number of
>>> occurrences of specific term in index? Is this only way you use index or
>>> this is side functionality?
>>>
>>> Thanks,
>>> Emir
>>>
>>>
>>> On 24.10.2015 22:28, Aki Balogh wrote:
>>>
>>>> Certainly, yes. I'm just doing a word count, ie how often does a
>>>> specific
>>>> term come up in the corpus?
>>>> On Oct 24, 2015 4:20 PM, "Upayavira" <u...@odoko.co.uk> wrote:
>>>>
>>>> yes, but what do you want to do with the TF? What problem are you
>>>>>
>>>>> solving with it? If you are able to share that...
>>>>>
>>>>> On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:
>>>>>
>>>>>> Yes, sorry, I am not being clear.
>>>>>>
>>>>>> We are not even doing scoring, just getting the raw TF values. We're
>>>>>> doing
>>>>>> this in solr because it can scale well.
>>>>>>
>>>>>> But with large corpora, retrieving the word counts takes some time, in
>>>>>> part
>>>>>> because solr is splitting up word count by document and generating a
>>>>>> large
>>>>>> request. We then get the request and just sum it all up. I'm wondering
>>>>>> if
>>>>>> there's a more direct way.
>>>>>> On Oct 24, 2015 4:00 PM, "Upayavira" <u...@odoko.co.uk> wrote:
>>>>>>
>>>>>> Can you explain more what you are using TF for? Because it sounds
>>>>>> rather
>>>>>> like scoring. You could disable field norms and IDF and scoring would
>>>>>> be
>>>>>> mostly TF, no?
>>>>>>>
>>>>>>> Upayavira
>>>>>>>
>>>>>>> On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
>>>>>>>
>>>>>>>> Thanks, let me think about that.
>>>>>>>>
>>>>>>>> We're using termfreq to get the TF score, but we don't know which
>>>>>>>>
>>>>>>> term
>>>>>>
>>>>>> we'll need the TF for. So we'd have to do a corpuswide summing of
>>>>>>>>
>>>>>>>> termfreq
>>>>>>>> for each potential term across all documents in the corpus. It seems
>>>>>>>>
>>>>>>> like
>>>>>>
>>>>>> it'd require some development work to compute that, and our code
>>>>>>>
>>>>>>> would be
>>>>>>
>>>>>> fragile.
>>>>>>>>
>>>>>>>> Let me think about that more.
>>>>>>>>
>>>>>>>> It might make sense to just move to solrcloud, it's the right
>>>>>>>> architectural
>>>>>>>> decision anyway.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <u...@odoko.co.uk> wrote:
>>>>>>>>
>>>>>>>> If you just want word length, then do work during indexing - index
>>>>>>>> a
>>>>>>
>>>>>> field for the word length. Then, I believe you can do faceting -
>>>>>>>>
>>>>>>>> e.g.
>>>>>>
>>>>>> with the json faceting API I believe you can do a sum()
>>>>>>>>
>>>>>>>> calculation on
>>>>>>
>>>>>> a
>>>>>>>>
>>>>>>>> field rather than the more traditional count.
>>>>>>>>>
>>>>>>>>> Thinking aloud, there might be an easier way - index a field that
>>>>>>>>>
>>>>>>>> is
>>>>>>
>>>>>> the
>>>>>>>>
>>>>>>>> same for all documents, and facet on it. Instead of counting the
>>>>>>>> number
>>>>>>
>>>>>> of documents, calculate the sum() of your word count field.
>>>>>>>>>
>>>>>>>>> I *think* that should work.
>>>>>>>>>
>>>>>>>>> Upayavira
>>>>>>>>>
>>>>>>>>> On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
>>>>>>>>>
>>>>>>>>>> Hi Jack,
>>>>>>>>>>
>>>>>>>>>> I'm just using solr to get word count across a large number of
>>>>>>>>>>
>>>>>>>>> documents.
>>>>>>>>
>>>>>>>> It's somewhat non-standard, because we're ignoring relevance,
>>>>>>>>>
>>>>>>>>> but it
>>>>>>
>>>>>> seems
>>>>>>>>>>
>>>>>>>>>> to work well for this use case otherwise.
>>>>>>>>>>
>>>>>>>>>> My understanding then is:
>>>>>>>>>> 1) since termfreq is pre-processed and fetched, there's no good
>>>>>>>>>>
>>>>>>>>> way
>>>>>>
>>>>>> to
>>>>>>>>
>>>>>>>> speed it up (except by caching earlier calculations)
>>>>>>>>>>
>>>>>>>>>> 2) there's no way to have solr sum up all of the termfreqs
>>>>>>>>>>
>>>>>>>>> across all
>>>>>>
>>>>>> documents in a search and just return one number for total
>>>>>>>>>
>>>>>>>>> termfreqs
>>>>>>>>>>
>>>>>>>>>> Are these correct?
>>>>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Aki
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
>>>>>>>>>> <jack.krupan...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> That's what a normal query does - Lucene takes all the terms
>>>>>>>>>> used
>>>>>>
>>>>>> in
>>>>>>>>
>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>> query and sums them up for each document in the response,
>>>>>>>>>> producing a
>>>>>>>>
>>>>>>>> single number, the score, for each document. That's the way
>>>>>>>>>>
>>>>>>>>>> Solr is
>>>>>>
>>>>>> designed to be used. You still haven't elaborated why you are
>>>>>>>>>>
>>>>>>>>>> trying
>>>>>>>>
>>>>>>>> to use
>>>>>>>>>>
>>>>>>>>>> Solr in a way other than it was intended.
>>>>>>>>>>>
>>>>>>>>>>> -- Jack Krupansky
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <
>>>>>>>>>>>
>>>>>>>>>> a...@marketmuse.com>
>>>>>>
>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Gotcha - that's disheartening.
>>>>>>>>>>>>
>>>>>>>>>>>> One idea: when I run termfreq, I get all of the termfreqs for
>>>>>>>>>>>>
>>>>>>>>>>> each
>>>>>>>>
>>>>>>>> document
>>>>>>>>>>>>
>>>>>>>>>>>> one-by-one.
>>>>>>>>>>>>
>>>>>>>>>>>> Is there a way to have solr sum it up before creating the
>>>>>>>>>>>>
>>>>>>>>>>> request,
>>>>>>>>
>>>>>>>> so I
>>>>>>>>>>
>>>>>>>>>> only receive one number in the response?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <u...@odoko.co.uk>
>>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> If you mean using the term frequency function query, then
>>>>>>>>>>>>
>>>>>>>>>>>> I'm
>>>>>>
>>>>>> not
>>>>>>>>
>>>>>>>> sure
>>>>>>>>>>
>>>>>>>>>> there's a huge amount you can do to improve performance.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The term frequency is a number that is used often, so it is
>>>>>>>>>>>>>
>>>>>>>>>>>> stored
>>>>>>>>
>>>>>>>> in
>>>>>>>>>>
>>>>>>>>>> the index pre-calculated. Perhaps, if your data is not
>>>>>>>>>>>>
>>>>>>>>>>>> changing,
>>>>>>>>
>>>>>>>> optimising your index would reduce it to one segment, and
>>>>>>>>>>>>
>>>>>>>>>>>> thus
>>>>>>
>>>>>> might
>>>>>>>>>>
>>>>>>>>>> ever so slightly speed the aggregation of term frequencies,
>>>>>>>>>>>>
>>>>>>>>>>>> but I
>>>>>>>>
>>>>>>>> doubt
>>>>>>>>>>
>>>>>>>>>> it'd make enough difference to make it worth doing.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Upayavira
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks, Jack. I did some more research and found similar
>>>>>>>>>>>>>>
>>>>>>>>>>>>> results.
>>>>>>>>
>>>>>>>> In our application, we are making multiple (think: 50)
>>>>>>>>>>>>>
>>>>>>>>>>>>> concurrent
>>>>>>>>
>>>>>>>> requests
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> to calculate term frequency on a set of documents in
>>>>>>>>>>>>>>
>>>>>>>>>>>>> "real-time". The
>>>>>>>>>>
>>>>>>>>>> faster that results return, the better.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Most of these requests are unique, so cache only helps
>>>>>>>>>>>>>>
>>>>>>>>>>>>> slightly.
>>>>>>>>
>>>>>>>> This analysis is happening on a single solr instance.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Other than moving to solr cloud and splitting out the
>>>>>>>>>>>>>>
>>>>>>>>>>>>> processing
>>>>>>>>
>>>>>>>> onto
>>>>>>>>>>
>>>>>>>>>> multiple servers, do you have any suggestions for what
>>>>>>>>>>>>>
>>>>>>>>>>>>> might
>>>>>>
>>>>>> speed up
>>>>>>>>>>
>>>>>>>>>> termfreq at query time?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>> Aki
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
>>>>>>>>>>>>>> <jack.krupan...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Term frequency applies only to the indexed terms of a
>>>>>>>>>>>>>> tokenized
>>>>>>>>
>>>>>>>> field.
>>>>>>>>>>>>>
>>>>>>>>>>>>> DocValues is really just a copy of the original source
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> text
>>>>>>
>>>>>> and is
>>>>>>>>>>
>>>>>>>>>> not
>>>>>>>>>>>>>
>>>>>>>>>>>>> tokenized into terms.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Maybe you could explain how exactly you are using term
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> frequency in
>>>>>>>>>>
>>>>>>>>>> function queries. More importantly, what is so "heavy"
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> about
>>>>>>>>
>>>>>>>> your
>>>>>>>>>>
>>>>>>>>>> usage?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Generally, moderate use of a feature is much more
>>>>>>>>>>>>>> advisable to
>>>>>>>>
>>>>>>>> heavy
>>>>>>>>>>>>
>>>>>>>>>>>> usage,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> unless you don't care about performance.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -- Jack Krupansky
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> a...@marketmuse.com>
>>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> In our solr application, we use a Function Query
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> (termfreq)
>>>>>>>>
>>>>>>>> very
>>>>>>>>>>
>>>>>>>>>> heavily.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Index time and disk space are not important, but
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> we're
>>>>>>
>>>>>> looking to
>>>>>>>>>>
>>>>>>>>>> improve
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> performance on termfreq at query time.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I've been reading up on docValues. Would this be a
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> way to
>>>>>>
>>>>>> improve
>>>>>>>>>>
>>>>>>>>>> performance?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I had read that Lucene uses Field Cache for Function
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Queries, so
>>>>>>>>>>
>>>>>>>>>> performance may not be affected.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> And, any general suggestions for improving query
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> performance
>>>>>>>>
>>>>>>>> on
>>>>>>>>>>
>>>>>>>>>> Function
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Queries?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>> Aki
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>> --
>>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
>>> Solr & Elasticsearch Support * http://sematext.com/
>>>
>>> <https://t.yesware.com/tl/506312808dab13214164f92fbcf5714d3ce38c6b/92f5492fd055692ff7f03b2888be3b50/7a8fd1f72b93af5d79583420b3483a7d?ytl=http%3A%2F%2Fsematext.com%2F>
>>>
>>>
>
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
>

Re: Does docValues impact termfreq ?

Reply via email to