Do be aware that docValues can only be used for non-text types, i.e. numerics, strings and the like. Specifically, docValues are _not_ possible for solr.textField and docValues don't support analysis chains because the underlying primitive types don't. You'll get an error if you try to specify docValues on a solr.TextField type.
Does that change the discussion? Best, Erick On Mon, Oct 26, 2015 at 7:36 AM, Emir Arnautovic <emir.arnauto...@sematext.com> wrote: > Hi Aki, > IMO this is underuse of Solr (not to mention SolrCloud). I would recommend > doing in memory document parsin (if you need something from Lucene/Solr > analysis classes, use it) and use some other cache like solution to store > term/total frequency pairs (you can try Redis). > > That way you will have updatable, fast total frequency lookups. > > Thanks, > Emir > > On 26.10.2015 14:43, Aki Balogh wrote: >> >> Hi Emir, >> >> This is correct. This is the only way we use the index. >> >> Thanks, >> Aki >> >> On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic < >> emir.arnauto...@sematext.com> wrote: >> >>> If I got it right, you are using term query, use function to get TF as >>> score, iterate all documents in results and sum up total number of >>> occurrences of specific term in index? Is this only way you use index or >>> this is side functionality? >>> >>> Thanks, >>> Emir >>> >>> >>> On 24.10.2015 22:28, Aki Balogh wrote: >>> >>>> Certainly, yes. I'm just doing a word count, ie how often does a >>>> specific >>>> term come up in the corpus? >>>> On Oct 24, 2015 4:20 PM, "Upayavira" <u...@odoko.co.uk> wrote: >>>> >>>> yes, but what do you want to do with the TF? What problem are you >>>>> >>>>> solving with it? If you are able to share that... >>>>> >>>>> On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote: >>>>> >>>>>> Yes, sorry, I am not being clear. >>>>>> >>>>>> We are not even doing scoring, just getting the raw TF values. We're >>>>>> doing >>>>>> this in solr because it can scale well. >>>>>> >>>>>> But with large corpora, retrieving the word counts takes some time, in >>>>>> part >>>>>> because solr is splitting up word count by document and generating a >>>>>> large >>>>>> request. We then get the request and just sum it all up. I'm wondering >>>>>> if >>>>>> there's a more direct way. >>>>>> On Oct 24, 2015 4:00 PM, "Upayavira" <u...@odoko.co.uk> wrote: >>>>>> >>>>>> Can you explain more what you are using TF for? Because it sounds >>>>>> rather >>>>>> like scoring. You could disable field norms and IDF and scoring would >>>>>> be >>>>>> mostly TF, no? >>>>>>> >>>>>>> Upayavira >>>>>>> >>>>>>> On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote: >>>>>>> >>>>>>>> Thanks, let me think about that. >>>>>>>> >>>>>>>> We're using termfreq to get the TF score, but we don't know which >>>>>>>> >>>>>>> term >>>>>> >>>>>> we'll need the TF for. So we'd have to do a corpuswide summing of >>>>>>>> >>>>>>>> termfreq >>>>>>>> for each potential term across all documents in the corpus. It seems >>>>>>>> >>>>>>> like >>>>>> >>>>>> it'd require some development work to compute that, and our code >>>>>>> >>>>>>> would be >>>>>> >>>>>> fragile. >>>>>>>> >>>>>>>> Let me think about that more. >>>>>>>> >>>>>>>> It might make sense to just move to solrcloud, it's the right >>>>>>>> architectural >>>>>>>> decision anyway. >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <u...@odoko.co.uk> wrote: >>>>>>>> >>>>>>>> If you just want word length, then do work during indexing - index >>>>>>>> a >>>>>> >>>>>> field for the word length. Then, I believe you can do faceting - >>>>>>>> >>>>>>>> e.g. >>>>>> >>>>>> with the json faceting API I believe you can do a sum() >>>>>>>> >>>>>>>> calculation on >>>>>> >>>>>> a >>>>>>>> >>>>>>>> field rather than the more traditional count. >>>>>>>>> >>>>>>>>> Thinking aloud, there might be an easier way - index a field that >>>>>>>>> >>>>>>>> is >>>>>> >>>>>> the >>>>>>>> >>>>>>>> same for all documents, and facet on it. Instead of counting the >>>>>>>> number >>>>>> >>>>>> of documents, calculate the sum() of your word count field. >>>>>>>>> >>>>>>>>> I *think* that should work. >>>>>>>>> >>>>>>>>> Upayavira >>>>>>>>> >>>>>>>>> On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote: >>>>>>>>> >>>>>>>>>> Hi Jack, >>>>>>>>>> >>>>>>>>>> I'm just using solr to get word count across a large number of >>>>>>>>>> >>>>>>>>> documents. >>>>>>>> >>>>>>>> It's somewhat non-standard, because we're ignoring relevance, >>>>>>>>> >>>>>>>>> but it >>>>>> >>>>>> seems >>>>>>>>>> >>>>>>>>>> to work well for this use case otherwise. >>>>>>>>>> >>>>>>>>>> My understanding then is: >>>>>>>>>> 1) since termfreq is pre-processed and fetched, there's no good >>>>>>>>>> >>>>>>>>> way >>>>>> >>>>>> to >>>>>>>> >>>>>>>> speed it up (except by caching earlier calculations) >>>>>>>>>> >>>>>>>>>> 2) there's no way to have solr sum up all of the termfreqs >>>>>>>>>> >>>>>>>>> across all >>>>>> >>>>>> documents in a search and just return one number for total >>>>>>>>> >>>>>>>>> termfreqs >>>>>>>>>> >>>>>>>>>> Are these correct? >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Aki >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky >>>>>>>>>> <jack.krupan...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> That's what a normal query does - Lucene takes all the terms >>>>>>>>>> used >>>>>> >>>>>> in >>>>>>>> >>>>>>>> the >>>>>>>>>> >>>>>>>>>> query and sums them up for each document in the response, >>>>>>>>>> producing a >>>>>>>> >>>>>>>> single number, the score, for each document. That's the way >>>>>>>>>> >>>>>>>>>> Solr is >>>>>> >>>>>> designed to be used. You still haven't elaborated why you are >>>>>>>>>> >>>>>>>>>> trying >>>>>>>> >>>>>>>> to use >>>>>>>>>> >>>>>>>>>> Solr in a way other than it was intended. >>>>>>>>>>> >>>>>>>>>>> -- Jack Krupansky >>>>>>>>>>> >>>>>>>>>>> On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh < >>>>>>>>>>> >>>>>>>>>> a...@marketmuse.com> >>>>>> >>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Gotcha - that's disheartening. >>>>>>>>>>>> >>>>>>>>>>>> One idea: when I run termfreq, I get all of the termfreqs for >>>>>>>>>>>> >>>>>>>>>>> each >>>>>>>> >>>>>>>> document >>>>>>>>>>>> >>>>>>>>>>>> one-by-one. >>>>>>>>>>>> >>>>>>>>>>>> Is there a way to have solr sum it up before creating the >>>>>>>>>>>> >>>>>>>>>>> request, >>>>>>>> >>>>>>>> so I >>>>>>>>>> >>>>>>>>>> only receive one number in the response? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <u...@odoko.co.uk> >>>>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>> >>>>>>>> If you mean using the term frequency function query, then >>>>>>>>>>>> >>>>>>>>>>>> I'm >>>>>> >>>>>> not >>>>>>>> >>>>>>>> sure >>>>>>>>>> >>>>>>>>>> there's a huge amount you can do to improve performance. >>>>>>>>>>>>> >>>>>>>>>>>>> The term frequency is a number that is used often, so it is >>>>>>>>>>>>> >>>>>>>>>>>> stored >>>>>>>> >>>>>>>> in >>>>>>>>>> >>>>>>>>>> the index pre-calculated. Perhaps, if your data is not >>>>>>>>>>>> >>>>>>>>>>>> changing, >>>>>>>> >>>>>>>> optimising your index would reduce it to one segment, and >>>>>>>>>>>> >>>>>>>>>>>> thus >>>>>> >>>>>> might >>>>>>>>>> >>>>>>>>>> ever so slightly speed the aggregation of term frequencies, >>>>>>>>>>>> >>>>>>>>>>>> but I >>>>>>>> >>>>>>>> doubt >>>>>>>>>> >>>>>>>>>> it'd make enough difference to make it worth doing. >>>>>>>>>>>>> >>>>>>>>>>>>> Upayavira >>>>>>>>>>>>> >>>>>>>>>>>>> On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, Jack. I did some more research and found similar >>>>>>>>>>>>>> >>>>>>>>>>>>> results. >>>>>>>> >>>>>>>> In our application, we are making multiple (think: 50) >>>>>>>>>>>>> >>>>>>>>>>>>> concurrent >>>>>>>> >>>>>>>> requests >>>>>>>>>>>>>> >>>>>>>>>>>>>> to calculate term frequency on a set of documents in >>>>>>>>>>>>>> >>>>>>>>>>>>> "real-time". The >>>>>>>>>> >>>>>>>>>> faster that results return, the better. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Most of these requests are unique, so cache only helps >>>>>>>>>>>>>> >>>>>>>>>>>>> slightly. >>>>>>>> >>>>>>>> This analysis is happening on a single solr instance. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Other than moving to solr cloud and splitting out the >>>>>>>>>>>>>> >>>>>>>>>>>>> processing >>>>>>>> >>>>>>>> onto >>>>>>>>>> >>>>>>>>>> multiple servers, do you have any suggestions for what >>>>>>>>>>>>> >>>>>>>>>>>>> might >>>>>> >>>>>> speed up >>>>>>>>>> >>>>>>>>>> termfreq at query time? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>> Aki >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky >>>>>>>>>>>>>> <jack.krupan...@gmail.com> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Term frequency applies only to the indexed terms of a >>>>>>>>>>>>>> tokenized >>>>>>>> >>>>>>>> field. >>>>>>>>>>>>> >>>>>>>>>>>>> DocValues is really just a copy of the original source >>>>>>>>>>>>>> >>>>>>>>>>>>>> text >>>>>> >>>>>> and is >>>>>>>>>> >>>>>>>>>> not >>>>>>>>>>>>> >>>>>>>>>>>>> tokenized into terms. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Maybe you could explain how exactly you are using term >>>>>>>>>>>>>>> >>>>>>>>>>>>>> frequency in >>>>>>>>>> >>>>>>>>>> function queries. More importantly, what is so "heavy" >>>>>>>>>>>>>> >>>>>>>>>>>>>> about >>>>>>>> >>>>>>>> your >>>>>>>>>> >>>>>>>>>> usage? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Generally, moderate use of a feature is much more >>>>>>>>>>>>>> advisable to >>>>>>>> >>>>>>>> heavy >>>>>>>>>>>> >>>>>>>>>>>> usage, >>>>>>>>>>>>>> >>>>>>>>>>>>>> unless you don't care about performance. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> -- Jack Krupansky >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh < >>>>>>>>>>>>>>> >>>>>>>>>>>>>> a...@marketmuse.com> >>>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hello, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In our solr application, we use a Function Query >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> (termfreq) >>>>>>>> >>>>>>>> very >>>>>>>>>> >>>>>>>>>> heavily. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Index time and disk space are not important, but >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> we're >>>>>> >>>>>> looking to >>>>>>>>>> >>>>>>>>>> improve >>>>>>>>>>>>>> >>>>>>>>>>>>>> performance on termfreq at query time. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I've been reading up on docValues. Would this be a >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> way to >>>>>> >>>>>> improve >>>>>>>>>> >>>>>>>>>> performance? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I had read that Lucene uses Field Cache for Function >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Queries, so >>>>>>>>>> >>>>>>>>>> performance may not be affected. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> And, any general suggestions for improving query >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> performance >>>>>>>> >>>>>>>> on >>>>>>>>>> >>>>>>>>>> Function >>>>>>>>>>>>>> >>>>>>>>>>>>>> Queries? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks, >>>>>>>>>>>>>>>> Aki >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>> -- >>> Monitoring * Alerting * Anomaly Detection * Centralized Log Management >>> Solr & Elasticsearch Support * http://sematext.com/ >>> >>> <https://t.yesware.com/tl/506312808dab13214164f92fbcf5714d3ce38c6b/92f5492fd055692ff7f03b2888be3b50/7a8fd1f72b93af5d79583420b3483a7d?ytl=http%3A%2F%2Fsematext.com%2F> >>> >>> > > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ >