Re: Does docValues impact termfreq ?

Emir Arnautovic Mon, 26 Oct 2015 07:37:36 -0700

Hi Aki,

IMO this is underuse of Solr (not to mention SolrCloud). I wouldrecommend doing in memory document parsin (if you need something fromLucene/Solr analysis classes, use it) and use some other cache likesolution to store term/total frequency pairs (you can try Redis).


That way you will have updatable, fast total frequency lookups.

Thanks,
Emir

On 26.10.2015 14:43, Aki Balogh wrote:

Hi Emir,

This is correct. This is the only way we use the index.

Thanks,
Aki

On Mon, Oct 26, 2015 at 9:31 AM, Emir Arnautovic <
[email protected]> wrote:

If I got it right, you are using term query, use function to get TF as
score, iterate all documents in results and sum up total number of
occurrences of specific term in index? Is this only way you use index or
this is side functionality?

Thanks,
Emir


On 24.10.2015 22:28, Aki Balogh wrote:

Certainly, yes. I'm just doing a word count, ie how often does a specific
term come up in the corpus?
On Oct 24, 2015 4:20 PM, "Upayavira" <[email protected]> wrote:

yes, but what do you want to do with the TF? What problem are you

solving with it? If you are able to share that...

On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:

Yes, sorry, I am not being clear.

We are not even doing scoring, just getting the raw TF values. We're
doing
this in solr because it can scale well.

But with large corpora, retrieving the word counts takes some time, in
part
because solr is splitting up word count by document and generating a
large
request. We then get the request and just sum it all up. I'm wondering
if
there's a more direct way.
On Oct 24, 2015 4:00 PM, "Upayavira" <[email protected]> wrote:

Can you explain more what you are using TF for? Because it sounds
rather
like scoring. You could disable field norms and IDF and scoring would
be
mostly TF, no?

Upayavira

On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:

Thanks, let me think about that.

We're using termfreq to get the TF score, but we don't know which

term

we'll need the TF for. So we'd have to do a corpuswide summing of

termfreq
for each potential term across all documents in the corpus. It seems

like

it'd require some development work to compute that, and our code

would be

fragile.

Let me think about that more.

It might make sense to just move to solrcloud, it's the right
architectural
decision anyway.


On Sat, Oct 24, 2015 at 1:54 PM, Upayavira <[email protected]> wrote:

If you just want word length, then do work during indexing - index
a

field for the word length. Then, I believe you can do faceting -

e.g.

with the json faceting API I believe you can do a sum()

calculation on

field rather than the more traditional count.

Thinking aloud, there might be an easier way - index a field that

is

the

same for all documents, and facet on it. Instead of counting the
number

of documents, calculate the sum() of your word count field.

I *think* that should work.

Upayavira

On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:

Hi Jack,

I'm just using solr to get word count across a large number of

documents.

It's somewhat non-standard, because we're ignoring relevance,

but it

seems

to work well for this use case otherwise.

My understanding then is:
1) since termfreq is pre-processed and fetched, there's no good

way

to

speed it up (except by caching earlier calculations)

2) there's no way to have solr sum up all of the termfreqs

across all

documents in a search and just return one number for total

termfreqs

Are these correct?

Thanks,
Aki


On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
<[email protected]>
wrote:

That's what a normal query does - Lucene takes all the terms
used

in

the

query and sums them up for each document in the response,
producing a

single number, the score, for each document. That's the way

Solr is

designed to be used. You still haven't elaborated why you are

trying

to use

Solr in a way other than it was intended.

-- Jack Krupansky

On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <

[email protected]>

wrote:

Gotcha - that's disheartening.

One idea: when I run termfreq, I get all of the termfreqs for

each

document

one-by-one.

Is there a way to have solr sum it up before creating the

request,

so I

only receive one number in the response?


On Sat, Oct 24, 2015 at 11:05 AM, Upayavira <[email protected]>

wrote:

If you mean using the term frequency function query, then

I'm

not

sure

there's a huge amount you can do to improve performance.

The term frequency is a number that is used often, so it is

stored

in

the index pre-calculated. Perhaps, if your data is not

changing,

optimising your index would reduce it to one segment, and

thus

might

ever so slightly speed the aggregation of term frequencies,

but I

doubt

it'd make enough difference to make it worth doing.

Upayavira

On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:

Thanks, Jack. I did some more research and found similar

results.

In our application, we are making multiple (think: 50)

concurrent

requests

to calculate term frequency on a set of documents in

"real-time". The

faster that results return, the better.

Most of these requests are unique, so cache only helps

slightly.

This analysis is happening on a single solr instance.

Other than moving to solr cloud and splitting out the

processing

onto

multiple servers, do you have any suggestions for what

might

speed up

termfreq at query time?

Thanks,
Aki


On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
<[email protected]>
wrote:

Term frequency applies only to the indexed terms of a
tokenized

field.

DocValues is really just a copy of the original source

text

and is

not

tokenized into terms.

Maybe you could explain how exactly you are using term

frequency in

function queries. More importantly, what is so "heavy"

about

your

usage?

Generally, moderate use of a feature is much more
advisable to

heavy

usage,

unless you don't care about performance.

-- Jack Krupansky

On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <

[email protected]>

wrote:

Hello,

In our solr application, we use a Function Query

(termfreq)

very

heavily.

Index time and disk space are not important, but

we're

looking to

improve

performance on termfreq at query time.

I've been reading up on docValues. Would this be a

way to

improve

performance?

I had read that Lucene uses Field Cache for Function

Queries, so

performance may not be affected.


And, any general suggestions for improving query

performance

on

Function

Queries?

Thanks,
Aki

--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/
<https://t.yesware.com/tl/506312808dab13214164f92fbcf5714d3ce38c6b/92f5492fd055692ff7f03b2888be3b50/7a8fd1f72b93af5d79583420b3483a7d?ytl=http%3A%2F%2Fsematext.com%2F>


--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/

Re: Does docValues impact termfreq ?

Reply via email to