Hi Emir, This is correct. This is the only way we use the index.
Thanks, Aki On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic < emir.arnauto...@sematext.com> wrote: > If I got it right, you are using term query, use function to get TF as > score, iterate all documents in results and sum up total number of > occurrences of specific term in index? Is this only way you use index or > this is side functionality? > > Thanks, > Emir > > > On 24.10.2015 22:28, Aki Balogh wrote: > >> Certainly, yes. I'm just doing a word count, ie how often does a specific >> term come up in the corpus? >> On Oct 24, 2015 4:20 PM, "Upayavira" <u...@odoko.co.uk> wrote: >> >> yes, but what do you want to do with the TF? What problem are you >>> solving with it? If you are able to share that... >>> >>> On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote: >>> >>>> Yes, sorry, I am not being clear. >>>> >>>> We are not even doing scoring, just getting the raw TF values. We're >>>> doing >>>> this in solr because it can scale well. >>>> >>>> But with large corpora, retrieving the word counts takes some time, in >>>> part >>>> because solr is splitting up word count by document and generating a >>>> large >>>> request. We then get the request and just sum it all up. I'm wondering >>>> if >>>> there's a more direct way. >>>> On Oct 24, 2015 4:00 PM, "Upayavira" <u...@odoko.co.uk> wrote: >>>> >>>> Can you explain more what you are using TF for? Because it sounds >>>>> >>>> rather >>> >>>> like scoring. You could disable field norms and IDF and scoring would >>>>> >>>> be >>> >>>> mostly TF, no? >>>>> >>>>> Upayavira >>>>> >>>>> On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote: >>>>> >>>>>> Thanks, let me think about that. >>>>>> >>>>>> We're using termfreq to get the TF score, but we don't know which >>>>>> >>>>> term >>> >>>> we'll need the TF for. So we'd have to do a corpuswide summing of >>>>>> termfreq >>>>>> for each potential term across all documents in the corpus. It seems >>>>>> >>>>> like >>> >>>> it'd require some development work to compute that, and our code >>>>>> >>>>> would be >>> >>>> fragile. >>>>>> >>>>>> Let me think about that more. >>>>>> >>>>>> It might make sense to just move to solrcloud, it's the right >>>>>> architectural >>>>>> decision anyway. >>>>>> >>>>>> >>>>>> On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <u...@odoko.co.uk> wrote: >>>>>> >>>>>> If you just want word length, then do work during indexing - index >>>>>>> >>>>>> a >>> >>>> field for the word length. Then, I believe you can do faceting - >>>>>>> >>>>>> e.g. >>> >>>> with the json faceting API I believe you can do a sum() >>>>>>> >>>>>> calculation on >>> >>>> a >>>>> >>>>>> field rather than the more traditional count. >>>>>>> >>>>>>> Thinking aloud, there might be an easier way - index a field that >>>>>>> >>>>>> is >>> >>>> the >>>>> >>>>>> same for all documents, and facet on it. Instead of counting the >>>>>>> >>>>>> number >>> >>>> of documents, calculate the sum() of your word count field. >>>>>>> >>>>>>> I *think* that should work. >>>>>>> >>>>>>> Upayavira >>>>>>> >>>>>>> On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote: >>>>>>> >>>>>>>> Hi Jack, >>>>>>>> >>>>>>>> I'm just using solr to get word count across a large number of >>>>>>>> >>>>>>> documents. >>>>> >>>>>> It's somewhat non-standard, because we're ignoring relevance, >>>>>>>> >>>>>>> but it >>> >>>> seems >>>>>>>> to work well for this use case otherwise. >>>>>>>> >>>>>>>> My understanding then is: >>>>>>>> 1) since termfreq is pre-processed and fetched, there's no good >>>>>>>> >>>>>>> way >>> >>>> to >>>>> >>>>>> speed it up (except by caching earlier calculations) >>>>>>>> >>>>>>>> 2) there's no way to have solr sum up all of the termfreqs >>>>>>>> >>>>>>> across all >>> >>>> documents in a search and just return one number for total >>>>>>>> >>>>>>> termfreqs >>> >>>> >>>>>>>> Are these correct? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Aki >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky >>>>>>>> <jack.krupan...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> That's what a normal query does - Lucene takes all the terms >>>>>>>>> >>>>>>>> used >>> >>>> in >>>>> >>>>>> the >>>>>>> >>>>>>>> query and sums them up for each document in the response, >>>>>>>>> >>>>>>>> producing a >>>>> >>>>>> single number, the score, for each document. That's the way >>>>>>>>> >>>>>>>> Solr is >>> >>>> designed to be used. You still haven't elaborated why you are >>>>>>>>> >>>>>>>> trying >>>>> >>>>>> to use >>>>>>> >>>>>>>> Solr in a way other than it was intended. >>>>>>>>> >>>>>>>>> -- Jack Krupansky >>>>>>>>> >>>>>>>>> On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh < >>>>>>>>> >>>>>>>> a...@marketmuse.com> >>> >>>> wrote: >>>>>>> >>>>>>>> Gotcha - that's disheartening. >>>>>>>>>> >>>>>>>>>> One idea: when I run termfreq, I get all of the termfreqs for >>>>>>>>>> >>>>>>>>> each >>>>> >>>>>> document >>>>>>>>> >>>>>>>>>> one-by-one. >>>>>>>>>> >>>>>>>>>> Is there a way to have solr sum it up before creating the >>>>>>>>>> >>>>>>>>> request, >>>>> >>>>>> so I >>>>>>> >>>>>>>> only receive one number in the response? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <u...@odoko.co.uk> >>>>>>>>>> >>>>>>>>> wrote: >>>>> >>>>>> If you mean using the term frequency function query, then >>>>>>>>>>> >>>>>>>>>> I'm >>> >>>> not >>>>> >>>>>> sure >>>>>>> >>>>>>>> there's a huge amount you can do to improve performance. >>>>>>>>>>> >>>>>>>>>>> The term frequency is a number that is used often, so it is >>>>>>>>>>> >>>>>>>>>> stored >>>>> >>>>>> in >>>>>>> >>>>>>>> the index pre-calculated. Perhaps, if your data is not >>>>>>>>>>> >>>>>>>>>> changing, >>>>> >>>>>> optimising your index would reduce it to one segment, and >>>>>>>>>>> >>>>>>>>>> thus >>> >>>> might >>>>>>> >>>>>>>> ever so slightly speed the aggregation of term frequencies, >>>>>>>>>>> >>>>>>>>>> but I >>>>> >>>>>> doubt >>>>>>> >>>>>>>> it'd make enough difference to make it worth doing. >>>>>>>>>>> >>>>>>>>>>> Upayavira >>>>>>>>>>> >>>>>>>>>>> On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks, Jack. I did some more research and found similar >>>>>>>>>>>> >>>>>>>>>>> results. >>>>> >>>>>> In our application, we are making multiple (think: 50) >>>>>>>>>>>> >>>>>>>>>>> concurrent >>>>> >>>>>> requests >>>>>>>>>>>> to calculate term frequency on a set of documents in >>>>>>>>>>>> >>>>>>>>>>> "real-time". The >>>>>>> >>>>>>>> faster that results return, the better. >>>>>>>>>>>> >>>>>>>>>>>> Most of these requests are unique, so cache only helps >>>>>>>>>>>> >>>>>>>>>>> slightly. >>>>> >>>>>> This analysis is happening on a single solr instance. >>>>>>>>>>>> >>>>>>>>>>>> Other than moving to solr cloud and splitting out the >>>>>>>>>>>> >>>>>>>>>>> processing >>>>> >>>>>> onto >>>>>>> >>>>>>>> multiple servers, do you have any suggestions for what >>>>>>>>>>>> >>>>>>>>>>> might >>> >>>> speed up >>>>>>> >>>>>>>> termfreq at query time? >>>>>>>>>>>> >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Aki >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky >>>>>>>>>>>> <jack.krupan...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Term frequency applies only to the indexed terms of a >>>>>>>>>>>>> >>>>>>>>>>>> tokenized >>>>> >>>>>> field. >>>>>>>>>> >>>>>>>>>>> DocValues is really just a copy of the original source >>>>>>>>>>>>> >>>>>>>>>>>> text >>> >>>> and is >>>>>>> >>>>>>>> not >>>>>>>>>> >>>>>>>>>>> tokenized into terms. >>>>>>>>>>>>> >>>>>>>>>>>>> Maybe you could explain how exactly you are using term >>>>>>>>>>>>> >>>>>>>>>>>> frequency in >>>>>>> >>>>>>>> function queries. More importantly, what is so "heavy" >>>>>>>>>>>>> >>>>>>>>>>>> about >>>>> >>>>>> your >>>>>>> >>>>>>>> usage? >>>>>>>>>>> >>>>>>>>>>>> Generally, moderate use of a feature is much more >>>>>>>>>>>>> >>>>>>>>>>>> advisable to >>>>> >>>>>> heavy >>>>>>>>> >>>>>>>>>> usage, >>>>>>>>>>> >>>>>>>>>>>> unless you don't care about performance. >>>>>>>>>>>>> >>>>>>>>>>>>> -- Jack Krupansky >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh < >>>>>>>>>>>>> >>>>>>>>>>>> a...@marketmuse.com> >>>>>>> >>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hello, >>>>>>>>>>>>>> >>>>>>>>>>>>>> In our solr application, we use a Function Query >>>>>>>>>>>>>> >>>>>>>>>>>>> (termfreq) >>>>> >>>>>> very >>>>>>> >>>>>>>> heavily. >>>>>>>>>>> >>>>>>>>>>>> Index time and disk space are not important, but >>>>>>>>>>>>>> >>>>>>>>>>>>> we're >>> >>>> looking to >>>>>>> >>>>>>>> improve >>>>>>>>>>> >>>>>>>>>>>> performance on termfreq at query time. >>>>>>>>>>>>>> I've been reading up on docValues. Would this be a >>>>>>>>>>>>>> >>>>>>>>>>>>> way to >>> >>>> improve >>>>>>> >>>>>>>> performance? >>>>>>>>>>>>>> >>>>>>>>>>>>>> I had read that Lucene uses Field Cache for Function >>>>>>>>>>>>>> >>>>>>>>>>>>> Queries, so >>>>>>> >>>>>>>> performance may not be affected. >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> And, any general suggestions for improving query >>>>>>>>>>>>>> >>>>>>>>>>>>> performance >>>>> >>>>>> on >>>>>>> >>>>>>>> Function >>>>>>>>>>> >>>>>>>>>>>> Queries? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Aki >>>>>>>>>>>>>> >>>>>>>>>>>>>> > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ > <https://t.yesware.com/tl/506312808dab13214164f92fbcf5714d3ce38c6b/92f5492fd055692ff7f03b2888be3b50/7a8fd1f72b93af5d79583420b3483a7d?ytl=http%3A%2F%2Fsematext.com%2F> > >