Re: Does docValues impact termfreq ?

Scott Stults Mon, 26 Oct 2015 07:08:09 -0700

Aki, does the sumtotaltermfreq function do what you need?


On Mon, Oct 26, 2015 at 9:43 AM, Aki Balogh <[email protected]> wrote:

> Hi Emir,
>
> This is correct. This is the only way we use the index.
>
> Thanks,
> Aki
>
> On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic <
> [email protected]> wrote:
>
> > If I got it right, you are using term query, use function to get TF as
> > score, iterate all documents in results and sum up total number of
> > occurrences of specific term in index? Is this only way you use index or
> > this is side functionality?
> >
> > Thanks,
> > Emir
> >
> >
> > On 24.10.2015 22:28, Aki Balogh wrote:
> >
> >> Certainly, yes. I'm just doing a word count, ie how often does a
> specific
> >> term come up in the corpus?
> >> On Oct 24, 2015 4:20 PM, "Upayavira" <[email protected]> wrote:
> >>
> >> yes, but what do you want to do with the TF? What problem are you
> >>> solving with it? If you are able to share that...
> >>>
> >>> On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:
> >>>
> >>>> Yes, sorry, I am not being clear.
> >>>>
> >>>> We are not even doing scoring, just getting the raw TF values. We're
> >>>> doing
> >>>> this in solr because it can scale well.
> >>>>
> >>>> But with large corpora, retrieving the word counts takes some time, in
> >>>> part
> >>>> because solr is splitting up word count by document and generating a
> >>>> large
> >>>> request. We then get the request and just sum it all up. I'm wondering
> >>>> if
> >>>> there's a more direct way.
> >>>> On Oct 24, 2015 4:00 PM, "Upayavira" <[email protected]> wrote:
> >>>>
> >>>> Can you explain more what you are using TF for? Because it sounds
> >>>>>
> >>>> rather
> >>>
> >>>> like scoring. You could disable field norms and IDF and scoring would
> >>>>>
> >>>> be
> >>>
> >>>> mostly TF, no?
> >>>>>
> >>>>> Upayavira
> >>>>>
> >>>>> On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
> >>>>>
> >>>>>> Thanks, let me think about that.
> >>>>>>
> >>>>>> We're using termfreq to get the TF score, but we don't know which
> >>>>>>
> >>>>> term
> >>>
> >>>> we'll need the TF for. So we'd have to do a corpuswide summing of
> >>>>>> termfreq
> >>>>>> for each potential term across all documents in the corpus. It seems
> >>>>>>
> >>>>> like
> >>>
> >>>> it'd require some development work to compute that, and our code
> >>>>>>
> >>>>> would be
> >>>
> >>>> fragile.
> >>>>>>
> >>>>>> Let me think about that more.
> >>>>>>
> >>>>>> It might make sense to just move to solrcloud, it's the right
> >>>>>> architectural
> >>>>>> decision anyway.
> >>>>>>
> >>>>>>
> >>>>>> On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <[email protected]> wrote:
> >>>>>>
> >>>>>> If you just want word length, then do work during indexing - index
> >>>>>>>
> >>>>>> a
> >>>
> >>>> field for the word length. Then, I believe you can do faceting -
> >>>>>>>
> >>>>>> e.g.
> >>>
> >>>> with the json faceting API I believe you can do a sum()
> >>>>>>>
> >>>>>> calculation on
> >>>
> >>>> a
> >>>>>
> >>>>>> field rather than the more traditional count.
> >>>>>>>
> >>>>>>> Thinking aloud, there might be an easier way - index a field that
> >>>>>>>
> >>>>>> is
> >>>
> >>>> the
> >>>>>
> >>>>>> same for all documents, and facet on it. Instead of counting the
> >>>>>>>
> >>>>>> number
> >>>
> >>>> of documents, calculate the sum() of your word count field.
> >>>>>>>
> >>>>>>> I *think* that should work.
> >>>>>>>
> >>>>>>> Upayavira
> >>>>>>>
> >>>>>>> On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
> >>>>>>>
> >>>>>>>> Hi Jack,
> >>>>>>>>
> >>>>>>>> I'm just using solr to get word count across a large number of
> >>>>>>>>
> >>>>>>> documents.
> >>>>>
> >>>>>> It's somewhat non-standard, because we're ignoring relevance,
> >>>>>>>>
> >>>>>>> but it
> >>>
> >>>> seems
> >>>>>>>> to work well for this use case otherwise.
> >>>>>>>>
> >>>>>>>> My understanding then is:
> >>>>>>>> 1) since termfreq is pre-processed and fetched, there's no good
> >>>>>>>>
> >>>>>>> way
> >>>
> >>>> to
> >>>>>
> >>>>>> speed it up (except by caching earlier calculations)
> >>>>>>>>
> >>>>>>>> 2) there's no way to have solr sum up all of the termfreqs
> >>>>>>>>
> >>>>>>> across all
> >>>
> >>>> documents in a search and just return one number for total
> >>>>>>>>
> >>>>>>> termfreqs
> >>>
> >>>>
> >>>>>>>> Are these correct?
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Aki
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
> >>>>>>>> <[email protected]>
> >>>>>>>> wrote:
> >>>>>>>>
> >>>>>>>> That's what a normal query does - Lucene takes all the terms
> >>>>>>>>>
> >>>>>>>> used
> >>>
> >>>> in
> >>>>>
> >>>>>> the
> >>>>>>>
> >>>>>>>> query and sums them up for each document in the response,
> >>>>>>>>>
> >>>>>>>> producing a
> >>>>>
> >>>>>> single number, the score, for each document. That's the way
> >>>>>>>>>
> >>>>>>>> Solr is
> >>>
> >>>> designed to be used. You still haven't elaborated why you are
> >>>>>>>>>
> >>>>>>>> trying
> >>>>>
> >>>>>> to use
> >>>>>>>
> >>>>>>>> Solr in a way other than it was intended.
> >>>>>>>>>
> >>>>>>>>> -- Jack Krupansky
> >>>>>>>>>
> >>>>>>>>> On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <
> >>>>>>>>>
> >>>>>>>> [email protected]>
> >>>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> Gotcha - that's disheartening.
> >>>>>>>>>>
> >>>>>>>>>> One idea: when I run termfreq, I get all of the termfreqs for
> >>>>>>>>>>
> >>>>>>>>> each
> >>>>>
> >>>>>> document
> >>>>>>>>>
> >>>>>>>>>> one-by-one.
> >>>>>>>>>>
> >>>>>>>>>> Is there a way to have solr sum it up before creating the
> >>>>>>>>>>
> >>>>>>>>> request,
> >>>>>
> >>>>>> so I
> >>>>>>>
> >>>>>>>> only receive one number in the response?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <[email protected]>
> >>>>>>>>>>
> >>>>>>>>> wrote:
> >>>>>
> >>>>>> If you mean using the term frequency function query, then
> >>>>>>>>>>>
> >>>>>>>>>> I'm
> >>>
> >>>> not
> >>>>>
> >>>>>> sure
> >>>>>>>
> >>>>>>>> there's a huge amount you can do to improve performance.
> >>>>>>>>>>>
> >>>>>>>>>>> The term frequency is a number that is used often, so it is
> >>>>>>>>>>>
> >>>>>>>>>> stored
> >>>>>
> >>>>>> in
> >>>>>>>
> >>>>>>>> the index pre-calculated. Perhaps, if your data is not
> >>>>>>>>>>>
> >>>>>>>>>> changing,
> >>>>>
> >>>>>> optimising your index would reduce it to one segment, and
> >>>>>>>>>>>
> >>>>>>>>>> thus
> >>>
> >>>> might
> >>>>>>>
> >>>>>>>> ever so slightly speed the aggregation of term frequencies,
> >>>>>>>>>>>
> >>>>>>>>>> but I
> >>>>>
> >>>>>> doubt
> >>>>>>>
> >>>>>>>> it'd make enough difference to make it worth doing.
> >>>>>>>>>>>
> >>>>>>>>>>> Upayavira
> >>>>>>>>>>>
> >>>>>>>>>>> On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Thanks, Jack. I did some more research and found similar
> >>>>>>>>>>>>
> >>>>>>>>>>> results.
> >>>>>
> >>>>>> In our application, we are making multiple (think: 50)
> >>>>>>>>>>>>
> >>>>>>>>>>> concurrent
> >>>>>
> >>>>>> requests
> >>>>>>>>>>>> to calculate term frequency on a set of documents in
> >>>>>>>>>>>>
> >>>>>>>>>>> "real-time". The
> >>>>>>>
> >>>>>>>> faster that results return, the better.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Most of these requests are unique, so cache only helps
> >>>>>>>>>>>>
> >>>>>>>>>>> slightly.
> >>>>>
> >>>>>> This analysis is happening on a single solr instance.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Other than moving to solr cloud and splitting out the
> >>>>>>>>>>>>
> >>>>>>>>>>> processing
> >>>>>
> >>>>>> onto
> >>>>>>>
> >>>>>>>> multiple servers, do you have any suggestions for what
> >>>>>>>>>>>>
> >>>>>>>>>>> might
> >>>
> >>>> speed up
> >>>>>>>
> >>>>>>>> termfreq at query time?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks,
> >>>>>>>>>>>> Aki
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> >>>>>>>>>>>> <[email protected]>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> Term frequency applies only to the indexed terms of a
> >>>>>>>>>>>>>
> >>>>>>>>>>>> tokenized
> >>>>>
> >>>>>> field.
> >>>>>>>>>>
> >>>>>>>>>>> DocValues is really just a copy of the original source
> >>>>>>>>>>>>>
> >>>>>>>>>>>> text
> >>>
> >>>> and is
> >>>>>>>
> >>>>>>>> not
> >>>>>>>>>>
> >>>>>>>>>>> tokenized into terms.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Maybe you could explain how exactly you are using term
> >>>>>>>>>>>>>
> >>>>>>>>>>>> frequency in
> >>>>>>>
> >>>>>>>> function queries. More importantly, what is so "heavy"
> >>>>>>>>>>>>>
> >>>>>>>>>>>> about
> >>>>>
> >>>>>> your
> >>>>>>>
> >>>>>>>> usage?
> >>>>>>>>>>>
> >>>>>>>>>>>> Generally, moderate use of a feature is much more
> >>>>>>>>>>>>>
> >>>>>>>>>>>> advisable to
> >>>>>
> >>>>>> heavy
> >>>>>>>>>
> >>>>>>>>>> usage,
> >>>>>>>>>>>
> >>>>>>>>>>>> unless you don't care about performance.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -- Jack Krupansky
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <
> >>>>>>>>>>>>>
> >>>>>>>>>>>> [email protected]>
> >>>>>>>
> >>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hello,
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In our solr application, we use a Function Query
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> (termfreq)
> >>>>>
> >>>>>> very
> >>>>>>>
> >>>>>>>> heavily.
> >>>>>>>>>>>
> >>>>>>>>>>>> Index time and disk space are not important, but
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> we're
> >>>
> >>>> looking to
> >>>>>>>
> >>>>>>>> improve
> >>>>>>>>>>>
> >>>>>>>>>>>> performance on termfreq at query time.
> >>>>>>>>>>>>>> I've been reading up on docValues. Would this be a
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> way to
> >>>
> >>>> improve
> >>>>>>>
> >>>>>>>> performance?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I had read that Lucene uses Field Cache for Function
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> Queries, so
> >>>>>>>
> >>>>>>>> performance may not be affected.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> And, any general suggestions for improving query
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>> performance
> >>>>>
> >>>>>> on
> >>>>>>>
> >>>>>>>> Function
> >>>>>>>>>>>
> >>>>>>>>>>>> Queries?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks,
> >>>>>>>>>>>>>> Aki
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> > --
> > Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> > Solr & Elasticsearch Support * http://sematext.com/
> > <
> https://t.yesware.com/tl/506312808dab13214164f92fbcf5714d3ce38c6b/92f5492fd055692ff7f03b2888be3b50/7a8fd1f72b93af5d79583420b3483a7d?ytl=http%3A%2F%2Fsematext.com%2F
> >
> >
> >
>



-- 
Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC
| 434.409.2780
http://www.opensourceconnections.com

Re: Does docValues impact termfreq ?

Reply via email to