Aki, does the sumtotaltermfreq function do what you need?
On Mon, Oct 26, 2015 at 9:43 AM, Aki Balogh <a...@marketmuse.com> wrote: > Hi Emir, > > This is correct. This is the only way we use the index. > > Thanks, > Aki > > On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic < > emir.arnauto...@sematext.com> wrote: > > > If I got it right, you are using term query, use function to get TF as > > score, iterate all documents in results and sum up total number of > > occurrences of specific term in index? Is this only way you use index or > > this is side functionality? > > > > Thanks, > > Emir > > > > > > On 24.10.2015 22:28, Aki Balogh wrote: > > > >> Certainly, yes. I'm just doing a word count, ie how often does a > specific > >> term come up in the corpus? > >> On Oct 24, 2015 4:20 PM, "Upayavira" <u...@odoko.co.uk> wrote: > >> > >> yes, but what do you want to do with the TF? What problem are you > >>> solving with it? If you are able to share that... > >>> > >>> On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote: > >>> > >>>> Yes, sorry, I am not being clear. > >>>> > >>>> We are not even doing scoring, just getting the raw TF values. We're > >>>> doing > >>>> this in solr because it can scale well. > >>>> > >>>> But with large corpora, retrieving the word counts takes some time, in > >>>> part > >>>> because solr is splitting up word count by document and generating a > >>>> large > >>>> request. We then get the request and just sum it all up. I'm wondering > >>>> if > >>>> there's a more direct way. > >>>> On Oct 24, 2015 4:00 PM, "Upayavira" <u...@odoko.co.uk> wrote: > >>>> > >>>> Can you explain more what you are using TF for? Because it sounds > >>>>> > >>>> rather > >>> > >>>> like scoring. You could disable field norms and IDF and scoring would > >>>>> > >>>> be > >>> > >>>> mostly TF, no? > >>>>> > >>>>> Upayavira > >>>>> > >>>>> On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote: > >>>>> > >>>>>> Thanks, let me think about that. > >>>>>> > >>>>>> We're using termfreq to get the TF score, but we don't know which > >>>>>> > >>>>> term > >>> > >>>> we'll need the TF for. So we'd have to do a corpuswide summing of > >>>>>> termfreq > >>>>>> for each potential term across all documents in the corpus. It seems > >>>>>> > >>>>> like > >>> > >>>> it'd require some development work to compute that, and our code > >>>>>> > >>>>> would be > >>> > >>>> fragile. > >>>>>> > >>>>>> Let me think about that more. > >>>>>> > >>>>>> It might make sense to just move to solrcloud, it's the right > >>>>>> architectural > >>>>>> decision anyway. > >>>>>> > >>>>>> > >>>>>> On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <u...@odoko.co.uk> wrote: > >>>>>> > >>>>>> If you just want word length, then do work during indexing - index > >>>>>>> > >>>>>> a > >>> > >>>> field for the word length. Then, I believe you can do faceting - > >>>>>>> > >>>>>> e.g. > >>> > >>>> with the json faceting API I believe you can do a sum() > >>>>>>> > >>>>>> calculation on > >>> > >>>> a > >>>>> > >>>>>> field rather than the more traditional count. > >>>>>>> > >>>>>>> Thinking aloud, there might be an easier way - index a field that > >>>>>>> > >>>>>> is > >>> > >>>> the > >>>>> > >>>>>> same for all documents, and facet on it. Instead of counting the > >>>>>>> > >>>>>> number > >>> > >>>> of documents, calculate the sum() of your word count field. > >>>>>>> > >>>>>>> I *think* that should work. > >>>>>>> > >>>>>>> Upayavira > >>>>>>> > >>>>>>> On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote: > >>>>>>> > >>>>>>>> Hi Jack, > >>>>>>>> > >>>>>>>> I'm just using solr to get word count across a large number of > >>>>>>>> > >>>>>>> documents. > >>>>> > >>>>>> It's somewhat non-standard, because we're ignoring relevance, > >>>>>>>> > >>>>>>> but it > >>> > >>>> seems > >>>>>>>> to work well for this use case otherwise. > >>>>>>>> > >>>>>>>> My understanding then is: > >>>>>>>> 1) since termfreq is pre-processed and fetched, there's no good > >>>>>>>> > >>>>>>> way > >>> > >>>> to > >>>>> > >>>>>> speed it up (except by caching earlier calculations) > >>>>>>>> > >>>>>>>> 2) there's no way to have solr sum up all of the termfreqs > >>>>>>>> > >>>>>>> across all > >>> > >>>> documents in a search and just return one number for total > >>>>>>>> > >>>>>>> termfreqs > >>> > >>>> > >>>>>>>> Are these correct? > >>>>>>>> > >>>>>>>> Thanks, > >>>>>>>> Aki > >>>>>>>> > >>>>>>>> > >>>>>>>> On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky > >>>>>>>> <jack.krupan...@gmail.com> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>> That's what a normal query does - Lucene takes all the terms > >>>>>>>>> > >>>>>>>> used > >>> > >>>> in > >>>>> > >>>>>> the > >>>>>>> > >>>>>>>> query and sums them up for each document in the response, > >>>>>>>>> > >>>>>>>> producing a > >>>>> > >>>>>> single number, the score, for each document. That's the way > >>>>>>>>> > >>>>>>>> Solr is > >>> > >>>> designed to be used. You still haven't elaborated why you are > >>>>>>>>> > >>>>>>>> trying > >>>>> > >>>>>> to use > >>>>>>> > >>>>>>>> Solr in a way other than it was intended. > >>>>>>>>> > >>>>>>>>> -- Jack Krupansky > >>>>>>>>> > >>>>>>>>> On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh < > >>>>>>>>> > >>>>>>>> a...@marketmuse.com> > >>> > >>>> wrote: > >>>>>>> > >>>>>>>> Gotcha - that's disheartening. > >>>>>>>>>> > >>>>>>>>>> One idea: when I run termfreq, I get all of the termfreqs for > >>>>>>>>>> > >>>>>>>>> each > >>>>> > >>>>>> document > >>>>>>>>> > >>>>>>>>>> one-by-one. > >>>>>>>>>> > >>>>>>>>>> Is there a way to have solr sum it up before creating the > >>>>>>>>>> > >>>>>>>>> request, > >>>>> > >>>>>> so I > >>>>>>> > >>>>>>>> only receive one number in the response? > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <u...@odoko.co.uk> > >>>>>>>>>> > >>>>>>>>> wrote: > >>>>> > >>>>>> If you mean using the term frequency function query, then > >>>>>>>>>>> > >>>>>>>>>> I'm > >>> > >>>> not > >>>>> > >>>>>> sure > >>>>>>> > >>>>>>>> there's a huge amount you can do to improve performance. > >>>>>>>>>>> > >>>>>>>>>>> The term frequency is a number that is used often, so it is > >>>>>>>>>>> > >>>>>>>>>> stored > >>>>> > >>>>>> in > >>>>>>> > >>>>>>>> the index pre-calculated. Perhaps, if your data is not > >>>>>>>>>>> > >>>>>>>>>> changing, > >>>>> > >>>>>> optimising your index would reduce it to one segment, and > >>>>>>>>>>> > >>>>>>>>>> thus > >>> > >>>> might > >>>>>>> > >>>>>>>> ever so slightly speed the aggregation of term frequencies, > >>>>>>>>>>> > >>>>>>>>>> but I > >>>>> > >>>>>> doubt > >>>>>>> > >>>>>>>> it'd make enough difference to make it worth doing. > >>>>>>>>>>> > >>>>>>>>>>> Upayavira > >>>>>>>>>>> > >>>>>>>>>>> On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Thanks, Jack. I did some more research and found similar > >>>>>>>>>>>> > >>>>>>>>>>> results. > >>>>> > >>>>>> In our application, we are making multiple (think: 50) > >>>>>>>>>>>> > >>>>>>>>>>> concurrent > >>>>> > >>>>>> requests > >>>>>>>>>>>> to calculate term frequency on a set of documents in > >>>>>>>>>>>> > >>>>>>>>>>> "real-time". The > >>>>>>> > >>>>>>>> faster that results return, the better. > >>>>>>>>>>>> > >>>>>>>>>>>> Most of these requests are unique, so cache only helps > >>>>>>>>>>>> > >>>>>>>>>>> slightly. > >>>>> > >>>>>> This analysis is happening on a single solr instance. > >>>>>>>>>>>> > >>>>>>>>>>>> Other than moving to solr cloud and splitting out the > >>>>>>>>>>>> > >>>>>>>>>>> processing > >>>>> > >>>>>> onto > >>>>>>> > >>>>>>>> multiple servers, do you have any suggestions for what > >>>>>>>>>>>> > >>>>>>>>>>> might > >>> > >>>> speed up > >>>>>>> > >>>>>>>> termfreq at query time? > >>>>>>>>>>>> > >>>>>>>>>>>> Thanks, > >>>>>>>>>>>> Aki > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky > >>>>>>>>>>>> <jack.krupan...@gmail.com> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Term frequency applies only to the indexed terms of a > >>>>>>>>>>>>> > >>>>>>>>>>>> tokenized > >>>>> > >>>>>> field. > >>>>>>>>>> > >>>>>>>>>>> DocValues is really just a copy of the original source > >>>>>>>>>>>>> > >>>>>>>>>>>> text > >>> > >>>> and is > >>>>>>> > >>>>>>>> not > >>>>>>>>>> > >>>>>>>>>>> tokenized into terms. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Maybe you could explain how exactly you are using term > >>>>>>>>>>>>> > >>>>>>>>>>>> frequency in > >>>>>>> > >>>>>>>> function queries. More importantly, what is so "heavy" > >>>>>>>>>>>>> > >>>>>>>>>>>> about > >>>>> > >>>>>> your > >>>>>>> > >>>>>>>> usage? > >>>>>>>>>>> > >>>>>>>>>>>> Generally, moderate use of a feature is much more > >>>>>>>>>>>>> > >>>>>>>>>>>> advisable to > >>>>> > >>>>>> heavy > >>>>>>>>> > >>>>>>>>>> usage, > >>>>>>>>>>> > >>>>>>>>>>>> unless you don't care about performance. > >>>>>>>>>>>>> > >>>>>>>>>>>>> -- Jack Krupansky > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh < > >>>>>>>>>>>>> > >>>>>>>>>>>> a...@marketmuse.com> > >>>>>>> > >>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> Hello, > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> In our solr application, we use a Function Query > >>>>>>>>>>>>>> > >>>>>>>>>>>>> (termfreq) > >>>>> > >>>>>> very > >>>>>>> > >>>>>>>> heavily. > >>>>>>>>>>> > >>>>>>>>>>>> Index time and disk space are not important, but > >>>>>>>>>>>>>> > >>>>>>>>>>>>> we're > >>> > >>>> looking to > >>>>>>> > >>>>>>>> improve > >>>>>>>>>>> > >>>>>>>>>>>> performance on termfreq at query time. > >>>>>>>>>>>>>> I've been reading up on docValues. Would this be a > >>>>>>>>>>>>>> > >>>>>>>>>>>>> way to > >>> > >>>> improve > >>>>>>> > >>>>>>>> performance? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> I had read that Lucene uses Field Cache for Function > >>>>>>>>>>>>>> > >>>>>>>>>>>>> Queries, so > >>>>>>> > >>>>>>>> performance may not be affected. > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> And, any general suggestions for improving query > >>>>>>>>>>>>>> > >>>>>>>>>>>>> performance > >>>>> > >>>>>> on > >>>>>>> > >>>>>>>> Function > >>>>>>>>>>> > >>>>>>>>>>>> Queries? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks, > >>>>>>>>>>>>>> Aki > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > > -- > > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > > Solr & Elasticsearch Support * http://sematext.com/ > > < > https://t.yesware.com/tl/506312808dab13214164f92fbcf5714d3ce38c6b/92f5492fd055692ff7f03b2888be3b50/7a8fd1f72b93af5d79583420b3483a7d?ytl=http%3A%2F%2Fsematext.com%2F > > > > > > > -- Scott Stults | Founder & Solutions Architect | OpenSource Connections, LLC | 434.409.2780 http://www.opensourceconnections.com