Re: Lucene/Solr Git Mirrors 5 day lag behind SVN?

2015-10-24 Thread Michael McCandless
I added a comment on the INFRA issue.

I don't understand why it periodically "gets stuck".

Mike McCandless

http://blog.mikemccandless.com


On Fri, Oct 23, 2015 at 11:27 AM, Kevin Risden
 wrote:
> It looks like both Apache Git mirror (git://git.apache.org/lucene-solr.git)
> and GitHub mirror (https://github.com/apache/lucene-solr.git) are 5 days
> behind SVN. This seems to have happened before:
> https://issues.apache.org/jira/browse/INFRA-9182
>
> Is this a known issue?
>
> Kevin Risden


Order of actions in Update request

2015-10-24 Thread Jamie Johnson
Looking at the code and jira I see that ordering actions in solrj update
request is currently not supported but I'd like to know if there is any
other way to get this capability.  I took a quick look at the XML loader
and it appears to process actions as it sees them so if the order was
changed to order the actions as

Add
Delete
Add

Vs
Add
Add
Delete

Would this cause any issues with the update?  Would it achieve the desired
result?  Are there any other options for ordering actions as they were
provided to the update request?

Jamie


Re: Order of actions in Update request

2015-10-24 Thread Joseph Hammerman
Hi Jamie!

On Sat, Oct 24, 2015 at 7:21 AM, Jamie Johnson  wrote:

> Looking at the code and jira I see that ordering actions in solrj update
> request is currently not supported but I'd like to know if there is any
> other way to get this capability.  I took a quick look at the XML loader
> and it appears to process actions as it sees them so if the order was
> changed to order the actions as
>
> Add
> Delete
> Add
>
> Vs
> Add
> Add
> Delete
>
> Would this cause any issues with the update?  Would it achieve the desired
> result?  Are there any other options for ordering actions as they were
> provided to the update request?
>
> Jamie
>

-- 
 

--

This message is intended only for the addressee. Please notify sender by 
e-mail if you are not the intended recipient. If you are not the intended 
recipient, you may not copy, disclose, or distribute this message or its 
contents, in either excerpts or in its entirety, to any other person and 
any such actions may be unlawful.  SecondMarket Solutions, Inc. and it 
subsidiaries ("SecondMarket") is not responsible for any unauthorized 
redistribution.


Securities-related services of SecondMarket are provided through SMTX, LLC 
(“SMTX”), a wholly owned subsidiary of SecondMarket and a registered broker 
dealer and member of FINRA/SIPC.   SMTX does not accept time sensitive, 
action-oriented messages or transaction orders, including orders to 
purchase or sell securities, via e-mail.  SMTX reserves the right to 
monitor and review the content of all messages sent to or from this e-mail 
address.  Messages sent to or from this e-mail address may be stored on the 
SMTX e-mail system and archived in accordance with FINRA and SEC rules and 
regulations.  

This message is intended for those with an in-depth understanding of the 
high risk and illiquid nature of private securities and these assets may 
not be suitable for you. This message does not represent a solicitation for 
an order or an offer to buy or sell any security.  There is not enough 
information contained in this message with which to make an investment 
decision and any information contained herein should not be used as a basis 
for this purpose. SMTX does not produce in-house research, make 
recommendations to purchase or sell specific securities, provide investment 
advisory services, or conduct a general retail business.


Re: Does docValues impact termfreq ?

2015-10-24 Thread Aki Balogh
Thanks, Jack. I did some more research and found similar results.

In our application, we are making multiple (think: 50) concurrent requests
to calculate term frequency on a set of documents in "real-time". The
faster that results return, the better.

Most of these requests are unique, so cache only helps slightly.

This analysis is happening on a single solr instance.

Other than moving to solr cloud and splitting out the processing onto
multiple servers, do you have any suggestions for what might speed up
termfreq at query time?

Thanks,
Aki


On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky 
wrote:

> Term frequency applies only to the indexed terms of a tokenized field.
> DocValues is really just a copy of the original source text and is not
> tokenized into terms.
>
> Maybe you could explain how exactly you are using term frequency in
> function queries. More importantly, what is so "heavy" about your usage?
> Generally, moderate use of a feature is much more advisable to heavy usage,
> unless you don't care about performance.
>
> -- Jack Krupansky
>
> On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh  wrote:
>
> > Hello,
> >
> > In our solr application, we use a Function Query (termfreq) very heavily.
> >
> > Index time and disk space are not important, but we're looking to improve
> > performance on termfreq at query time.
> > I've been reading up on docValues. Would this be a way to improve
> > performance?
> >
> > I had read that Lucene uses Field Cache for Function Queries, so
> > performance may not be affected.
> >
> >
> > And, any general suggestions for improving query performance on Function
> > Queries?
> >
> > Thanks,
> > Aki
> >
>


Re: Does docValues impact termfreq ?

2015-10-24 Thread Upayavira
If you mean using the term frequency function query, then I'm not sure
there's a huge amount you can do to improve performance.

The term frequency is a number that is used often, so it is stored in
the index pre-calculated. Perhaps, if your data is not changing,
optimising your index would reduce it to one segment, and thus might
ever so slightly speed the aggregation of term frequencies, but I doubt
it'd make enough difference to make it worth doing.

Upayavira

On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> Thanks, Jack. I did some more research and found similar results.
> 
> In our application, we are making multiple (think: 50) concurrent
> requests
> to calculate term frequency on a set of documents in "real-time". The
> faster that results return, the better.
> 
> Most of these requests are unique, so cache only helps slightly.
> 
> This analysis is happening on a single solr instance.
> 
> Other than moving to solr cloud and splitting out the processing onto
> multiple servers, do you have any suggestions for what might speed up
> termfreq at query time?
> 
> Thanks,
> Aki
> 
> 
> On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> 
> wrote:
> 
> > Term frequency applies only to the indexed terms of a tokenized field.
> > DocValues is really just a copy of the original source text and is not
> > tokenized into terms.
> >
> > Maybe you could explain how exactly you are using term frequency in
> > function queries. More importantly, what is so "heavy" about your usage?
> > Generally, moderate use of a feature is much more advisable to heavy usage,
> > unless you don't care about performance.
> >
> > -- Jack Krupansky
> >
> > On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh  wrote:
> >
> > > Hello,
> > >
> > > In our solr application, we use a Function Query (termfreq) very heavily.
> > >
> > > Index time and disk space are not important, but we're looking to improve
> > > performance on termfreq at query time.
> > > I've been reading up on docValues. Would this be a way to improve
> > > performance?
> > >
> > > I had read that Lucene uses Field Cache for Function Queries, so
> > > performance may not be affected.
> > >
> > >
> > > And, any general suggestions for improving query performance on Function
> > > Queries?
> > >
> > > Thanks,
> > > Aki
> > >
> >


Re: Does docValues impact termfreq ?

2015-10-24 Thread Aki Balogh
Gotcha - that's disheartening.

One idea: when I run termfreq, I get all of the termfreqs for each document
one-by-one.

Is there a way to have solr sum it up before creating the request, so I
only receive one number in the response?


On Sat, Oct 24, 2015 at 11:05 AM, Upayavira  wrote:

> If you mean using the term frequency function query, then I'm not sure
> there's a huge amount you can do to improve performance.
>
> The term frequency is a number that is used often, so it is stored in
> the index pre-calculated. Perhaps, if your data is not changing,
> optimising your index would reduce it to one segment, and thus might
> ever so slightly speed the aggregation of term frequencies, but I doubt
> it'd make enough difference to make it worth doing.
>
> Upayavira
>
> On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> > Thanks, Jack. I did some more research and found similar results.
> >
> > In our application, we are making multiple (think: 50) concurrent
> > requests
> > to calculate term frequency on a set of documents in "real-time". The
> > faster that results return, the better.
> >
> > Most of these requests are unique, so cache only helps slightly.
> >
> > This analysis is happening on a single solr instance.
> >
> > Other than moving to solr cloud and splitting out the processing onto
> > multiple servers, do you have any suggestions for what might speed up
> > termfreq at query time?
> >
> > Thanks,
> > Aki
> >
> >
> > On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> > 
> > wrote:
> >
> > > Term frequency applies only to the indexed terms of a tokenized field.
> > > DocValues is really just a copy of the original source text and is not
> > > tokenized into terms.
> > >
> > > Maybe you could explain how exactly you are using term frequency in
> > > function queries. More importantly, what is so "heavy" about your
> usage?
> > > Generally, moderate use of a feature is much more advisable to heavy
> usage,
> > > unless you don't care about performance.
> > >
> > > -- Jack Krupansky
> > >
> > > On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh 
> wrote:
> > >
> > > > Hello,
> > > >
> > > > In our solr application, we use a Function Query (termfreq) very
> heavily.
> > > >
> > > > Index time and disk space are not important, but we're looking to
> improve
> > > > performance on termfreq at query time.
> > > > I've been reading up on docValues. Would this be a way to improve
> > > > performance?
> > > >
> > > > I had read that Lucene uses Field Cache for Function Queries, so
> > > > performance may not be affected.
> > > >
> > > >
> > > > And, any general suggestions for improving query performance on
> Function
> > > > Queries?
> > > >
> > > > Thanks,
> > > > Aki
> > > >
> > >
>


Re: Does docValues impact termfreq ?

2015-10-24 Thread Jack Krupansky
That's what a normal query does - Lucene takes all the terms used in the
query and sums them up for each document in the response, producing a
single number, the score, for each document. That's the way Solr is
designed to be used. You still haven't elaborated why you are trying to use
Solr in a way other than it was intended.

-- Jack Krupansky

On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh  wrote:

> Gotcha - that's disheartening.
>
> One idea: when I run termfreq, I get all of the termfreqs for each document
> one-by-one.
>
> Is there a way to have solr sum it up before creating the request, so I
> only receive one number in the response?
>
>
> On Sat, Oct 24, 2015 at 11:05 AM, Upayavira  wrote:
>
> > If you mean using the term frequency function query, then I'm not sure
> > there's a huge amount you can do to improve performance.
> >
> > The term frequency is a number that is used often, so it is stored in
> > the index pre-calculated. Perhaps, if your data is not changing,
> > optimising your index would reduce it to one segment, and thus might
> > ever so slightly speed the aggregation of term frequencies, but I doubt
> > it'd make enough difference to make it worth doing.
> >
> > Upayavira
> >
> > On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> > > Thanks, Jack. I did some more research and found similar results.
> > >
> > > In our application, we are making multiple (think: 50) concurrent
> > > requests
> > > to calculate term frequency on a set of documents in "real-time". The
> > > faster that results return, the better.
> > >
> > > Most of these requests are unique, so cache only helps slightly.
> > >
> > > This analysis is happening on a single solr instance.
> > >
> > > Other than moving to solr cloud and splitting out the processing onto
> > > multiple servers, do you have any suggestions for what might speed up
> > > termfreq at query time?
> > >
> > > Thanks,
> > > Aki
> > >
> > >
> > > On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> > > 
> > > wrote:
> > >
> > > > Term frequency applies only to the indexed terms of a tokenized
> field.
> > > > DocValues is really just a copy of the original source text and is
> not
> > > > tokenized into terms.
> > > >
> > > > Maybe you could explain how exactly you are using term frequency in
> > > > function queries. More importantly, what is so "heavy" about your
> > usage?
> > > > Generally, moderate use of a feature is much more advisable to heavy
> > usage,
> > > > unless you don't care about performance.
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh 
> > wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > In our solr application, we use a Function Query (termfreq) very
> > heavily.
> > > > >
> > > > > Index time and disk space are not important, but we're looking to
> > improve
> > > > > performance on termfreq at query time.
> > > > > I've been reading up on docValues. Would this be a way to improve
> > > > > performance?
> > > > >
> > > > > I had read that Lucene uses Field Cache for Function Queries, so
> > > > > performance may not be affected.
> > > > >
> > > > >
> > > > > And, any general suggestions for improving query performance on
> > Function
> > > > > Queries?
> > > > >
> > > > > Thanks,
> > > > > Aki
> > > > >
> > > >
> >
>


Re: Does docValues impact termfreq ?

2015-10-24 Thread Aki Balogh
Hi Jack,

I'm just using solr to get word count across a large number of documents.

It's somewhat non-standard, because we're ignoring relevance, but it seems
to work well for this use case otherwise.

My understanding then is:
1) since termfreq is pre-processed and fetched, there's no good way to
speed it up (except by caching earlier calculations)

2) there's no way to have solr sum up all of the termfreqs across all
documents in a search and just return one number for total termfreqs


Are these correct?

Thanks,
Aki


On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky 
wrote:

> That's what a normal query does - Lucene takes all the terms used in the
> query and sums them up for each document in the response, producing a
> single number, the score, for each document. That's the way Solr is
> designed to be used. You still haven't elaborated why you are trying to use
> Solr in a way other than it was intended.
>
> -- Jack Krupansky
>
> On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh  wrote:
>
> > Gotcha - that's disheartening.
> >
> > One idea: when I run termfreq, I get all of the termfreqs for each
> document
> > one-by-one.
> >
> > Is there a way to have solr sum it up before creating the request, so I
> > only receive one number in the response?
> >
> >
> > On Sat, Oct 24, 2015 at 11:05 AM, Upayavira  wrote:
> >
> > > If you mean using the term frequency function query, then I'm not sure
> > > there's a huge amount you can do to improve performance.
> > >
> > > The term frequency is a number that is used often, so it is stored in
> > > the index pre-calculated. Perhaps, if your data is not changing,
> > > optimising your index would reduce it to one segment, and thus might
> > > ever so slightly speed the aggregation of term frequencies, but I doubt
> > > it'd make enough difference to make it worth doing.
> > >
> > > Upayavira
> > >
> > > On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> > > > Thanks, Jack. I did some more research and found similar results.
> > > >
> > > > In our application, we are making multiple (think: 50) concurrent
> > > > requests
> > > > to calculate term frequency on a set of documents in "real-time". The
> > > > faster that results return, the better.
> > > >
> > > > Most of these requests are unique, so cache only helps slightly.
> > > >
> > > > This analysis is happening on a single solr instance.
> > > >
> > > > Other than moving to solr cloud and splitting out the processing onto
> > > > multiple servers, do you have any suggestions for what might speed up
> > > > termfreq at query time?
> > > >
> > > > Thanks,
> > > > Aki
> > > >
> > > >
> > > > On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> > > > 
> > > > wrote:
> > > >
> > > > > Term frequency applies only to the indexed terms of a tokenized
> > field.
> > > > > DocValues is really just a copy of the original source text and is
> > not
> > > > > tokenized into terms.
> > > > >
> > > > > Maybe you could explain how exactly you are using term frequency in
> > > > > function queries. More importantly, what is so "heavy" about your
> > > usage?
> > > > > Generally, moderate use of a feature is much more advisable to
> heavy
> > > usage,
> > > > > unless you don't care about performance.
> > > > >
> > > > > -- Jack Krupansky
> > > > >
> > > > > On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh 
> > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > In our solr application, we use a Function Query (termfreq) very
> > > heavily.
> > > > > >
> > > > > > Index time and disk space are not important, but we're looking to
> > > improve
> > > > > > performance on termfreq at query time.
> > > > > > I've been reading up on docValues. Would this be a way to improve
> > > > > > performance?
> > > > > >
> > > > > > I had read that Lucene uses Field Cache for Function Queries, so
> > > > > > performance may not be affected.
> > > > > >
> > > > > >
> > > > > > And, any general suggestions for improving query performance on
> > > Function
> > > > > > Queries?
> > > > > >
> > > > > > Thanks,
> > > > > > Aki
> > > > > >
> > > > >
> > >
> >
>


Re: Does docValues impact termfreq ?

2015-10-24 Thread Upayavira
If you just want word length, then do work during indexing - index a
field for the word length. Then, I believe you can do faceting - e.g.
with the json faceting API I believe you can do a sum() calculation on a
field rather than the more traditional count.

Thinking aloud, there might be an easier way - index a field that is the
same for all documents, and facet on it. Instead of counting the number
of documents, calculate the sum() of your word count field.

I *think* that should work.

Upayavira

On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
> Hi Jack,
> 
> I'm just using solr to get word count across a large number of documents.
> 
> It's somewhat non-standard, because we're ignoring relevance, but it
> seems
> to work well for this use case otherwise.
> 
> My understanding then is:
> 1) since termfreq is pre-processed and fetched, there's no good way to
> speed it up (except by caching earlier calculations)
> 
> 2) there's no way to have solr sum up all of the termfreqs across all
> documents in a search and just return one number for total termfreqs
> 
> 
> Are these correct?
> 
> Thanks,
> Aki
> 
> 
> On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
> 
> wrote:
> 
> > That's what a normal query does - Lucene takes all the terms used in the
> > query and sums them up for each document in the response, producing a
> > single number, the score, for each document. That's the way Solr is
> > designed to be used. You still haven't elaborated why you are trying to use
> > Solr in a way other than it was intended.
> >
> > -- Jack Krupansky
> >
> > On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh  wrote:
> >
> > > Gotcha - that's disheartening.
> > >
> > > One idea: when I run termfreq, I get all of the termfreqs for each
> > document
> > > one-by-one.
> > >
> > > Is there a way to have solr sum it up before creating the request, so I
> > > only receive one number in the response?
> > >
> > >
> > > On Sat, Oct 24, 2015 at 11:05 AM, Upayavira  wrote:
> > >
> > > > If you mean using the term frequency function query, then I'm not sure
> > > > there's a huge amount you can do to improve performance.
> > > >
> > > > The term frequency is a number that is used often, so it is stored in
> > > > the index pre-calculated. Perhaps, if your data is not changing,
> > > > optimising your index would reduce it to one segment, and thus might
> > > > ever so slightly speed the aggregation of term frequencies, but I doubt
> > > > it'd make enough difference to make it worth doing.
> > > >
> > > > Upayavira
> > > >
> > > > On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> > > > > Thanks, Jack. I did some more research and found similar results.
> > > > >
> > > > > In our application, we are making multiple (think: 50) concurrent
> > > > > requests
> > > > > to calculate term frequency on a set of documents in "real-time". The
> > > > > faster that results return, the better.
> > > > >
> > > > > Most of these requests are unique, so cache only helps slightly.
> > > > >
> > > > > This analysis is happening on a single solr instance.
> > > > >
> > > > > Other than moving to solr cloud and splitting out the processing onto
> > > > > multiple servers, do you have any suggestions for what might speed up
> > > > > termfreq at query time?
> > > > >
> > > > > Thanks,
> > > > > Aki
> > > > >
> > > > >
> > > > > On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> > > > > 
> > > > > wrote:
> > > > >
> > > > > > Term frequency applies only to the indexed terms of a tokenized
> > > field.
> > > > > > DocValues is really just a copy of the original source text and is
> > > not
> > > > > > tokenized into terms.
> > > > > >
> > > > > > Maybe you could explain how exactly you are using term frequency in
> > > > > > function queries. More importantly, what is so "heavy" about your
> > > > usage?
> > > > > > Generally, moderate use of a feature is much more advisable to
> > heavy
> > > > usage,
> > > > > > unless you don't care about performance.
> > > > > >
> > > > > > -- Jack Krupansky
> > > > > >
> > > > > > On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh 
> > > > wrote:
> > > > > >
> > > > > > > Hello,
> > > > > > >
> > > > > > > In our solr application, we use a Function Query (termfreq) very
> > > > heavily.
> > > > > > >
> > > > > > > Index time and disk space are not important, but we're looking to
> > > > improve
> > > > > > > performance on termfreq at query time.
> > > > > > > I've been reading up on docValues. Would this be a way to improve
> > > > > > > performance?
> > > > > > >
> > > > > > > I had read that Lucene uses Field Cache for Function Queries, so
> > > > > > > performance may not be affected.
> > > > > > >
> > > > > > >
> > > > > > > And, any general suggestions for improving query performance on
> > > > Function
> > > > > > > Queries?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Aki
> > > > > > >
> > > > > >
> > > >
> > >
> >


any clean test failing

2015-10-24 Thread William Bell
It is getting stuck on resolve.

ant clean test

SOLR 5.3.1

[ivy:retrieve] retrieve done (5ms)

Overriding previous definition of property "ivy.version"

[ivy:retrieve] no resolved descriptor found: launching default resolve

Overriding previous definition of property "ivy.version"

[ivy:retrieve] using ivy parser to parse
file:/home/solr/src/lucene_solr_5_3_1A/solr/server/ivy.xml

[ivy:retrieve] :: resolving dependencies :: org.apache.solr#
example;work...@hgsolr2devmstr.healthgrades.com

[ivy:retrieve] confs: [logging]

[ivy:retrieve] validate = true

[ivy:retrieve] refresh = false

[ivy:retrieve] resolving dependencies for configuration 'logging'

[ivy:retrieve] == resolving dependencies for org.apache.solr#
example;work...@hgsolr2devmstr.healthgrades.com [logging]

[ivy:retrieve] == resolving dependencies
org.apache.solr#example;work...@hgsolr2devmstr.healthgrades.com->log4j#log4j;1.2.17
[logging->master]

[ivy:retrieve] default: Checking cache for: dependency: log4j#log4j;1.2.17
{logging=[master]}

[ivy:retrieve] don't use cache for log4j#log4j;1.2.17: checkModified=true

[ivy:retrieve] tried /home/solr/.ivy2/local/log4j/log4j/1.2.17/ivys/ivy.xml

[ivy:retrieve] tried
/home/solr/.ivy2/local/log4j/log4j/1.2.17/jars/log4j.jar

[ivy:retrieve] local: no ivy file nor artifact found for log4j#log4j;1.2.17

[ivy:retrieve] main: Checking cache for: dependency: log4j#log4j;1.2.17
{logging=[master]}

[ivy:retrieve] main: module revision found in cache: log4j#log4j;1.2.17

[ivy:retrieve] found log4j#log4j;1.2.17 in public

[ivy:retrieve] == resolving dependencies
org.apache.solr#example;work...@hgsolr2devmstr.healthgrades.com->org.slf4j#slf4j-api;1.7.7
[logging->master]

[ivy:retrieve] default: Checking cache for: dependency:
org.slf4j#slf4j-api;1.7.7 {logging=[master]}

[ivy:retrieve] don't use cache for org.slf4j#slf4j-api;1.7.7:
checkModified=true

[ivy:retrieve] tried
/home/solr/.ivy2/local/org.slf4j/slf4j-api/1.7.7/ivys/ivy.xml

[ivy:retrieve] tried
/home/solr/.ivy2/local/org.slf4j/slf4j-api/1.7.7/jars/slf4j-api.jar

[ivy:retrieve] local: no ivy file nor artifact found for
org.slf4j#slf4j-api;1.7.7

[ivy:retrieve] main: Checking cache for: dependency:
org.slf4j#slf4j-api;1.7.7 {logging=[master]}

[ivy:retrieve] main: module revision found in cache:
org.slf4j#slf4j-api;1.7.7

[ivy:retrieve] found org.slf4j#slf4j-api;1.7.7 in public

[ivy:retrieve] == resolving dependencies
org.apache.solr#example;work...@hgsolr2devmstr.healthgrades.com->org.slf4j#jcl-over-slf4j;1.7.7
[logging->master]

[ivy:retrieve] default: Checking cache for: dependency:
org.slf4j#jcl-over-slf4j;1.7.7 {logging=[master]}

[ivy:retrieve] don't use cache for org.slf4j#jcl-over-slf4j;1.7.7:
checkModified=true

-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Does docValues impact termfreq ?

2015-10-24 Thread Aki Balogh
Thanks, let me think about that.

We're using termfreq to get the TF score, but we don't know which term
we'll need the TF for. So we'd have to do a corpuswide summing of termfreq
for each potential term across all documents in the corpus. It seems like
it'd require some development work to compute that, and our code would be
fragile.

Let me think about that more.

It might make sense to just move to solrcloud, it's the right architectural
decision anyway.


On Sat, Oct 24, 2015 at 1:54 PM, Upayavira  wrote:

> If you just want word length, then do work during indexing - index a
> field for the word length. Then, I believe you can do faceting - e.g.
> with the json faceting API I believe you can do a sum() calculation on a
> field rather than the more traditional count.
>
> Thinking aloud, there might be an easier way - index a field that is the
> same for all documents, and facet on it. Instead of counting the number
> of documents, calculate the sum() of your word count field.
>
> I *think* that should work.
>
> Upayavira
>
> On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
> > Hi Jack,
> >
> > I'm just using solr to get word count across a large number of documents.
> >
> > It's somewhat non-standard, because we're ignoring relevance, but it
> > seems
> > to work well for this use case otherwise.
> >
> > My understanding then is:
> > 1) since termfreq is pre-processed and fetched, there's no good way to
> > speed it up (except by caching earlier calculations)
> >
> > 2) there's no way to have solr sum up all of the termfreqs across all
> > documents in a search and just return one number for total termfreqs
> >
> >
> > Are these correct?
> >
> > Thanks,
> > Aki
> >
> >
> > On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
> > 
> > wrote:
> >
> > > That's what a normal query does - Lucene takes all the terms used in
> the
> > > query and sums them up for each document in the response, producing a
> > > single number, the score, for each document. That's the way Solr is
> > > designed to be used. You still haven't elaborated why you are trying
> to use
> > > Solr in a way other than it was intended.
> > >
> > > -- Jack Krupansky
> > >
> > > On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh 
> wrote:
> > >
> > > > Gotcha - that's disheartening.
> > > >
> > > > One idea: when I run termfreq, I get all of the termfreqs for each
> > > document
> > > > one-by-one.
> > > >
> > > > Is there a way to have solr sum it up before creating the request,
> so I
> > > > only receive one number in the response?
> > > >
> > > >
> > > > On Sat, Oct 24, 2015 at 11:05 AM, Upayavira  wrote:
> > > >
> > > > > If you mean using the term frequency function query, then I'm not
> sure
> > > > > there's a huge amount you can do to improve performance.
> > > > >
> > > > > The term frequency is a number that is used often, so it is stored
> in
> > > > > the index pre-calculated. Perhaps, if your data is not changing,
> > > > > optimising your index would reduce it to one segment, and thus
> might
> > > > > ever so slightly speed the aggregation of term frequencies, but I
> doubt
> > > > > it'd make enough difference to make it worth doing.
> > > > >
> > > > > Upayavira
> > > > >
> > > > > On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> > > > > > Thanks, Jack. I did some more research and found similar results.
> > > > > >
> > > > > > In our application, we are making multiple (think: 50) concurrent
> > > > > > requests
> > > > > > to calculate term frequency on a set of documents in
> "real-time". The
> > > > > > faster that results return, the better.
> > > > > >
> > > > > > Most of these requests are unique, so cache only helps slightly.
> > > > > >
> > > > > > This analysis is happening on a single solr instance.
> > > > > >
> > > > > > Other than moving to solr cloud and splitting out the processing
> onto
> > > > > > multiple servers, do you have any suggestions for what might
> speed up
> > > > > > termfreq at query time?
> > > > > >
> > > > > > Thanks,
> > > > > > Aki
> > > > > >
> > > > > >
> > > > > > On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> > > > > > 
> > > > > > wrote:
> > > > > >
> > > > > > > Term frequency applies only to the indexed terms of a tokenized
> > > > field.
> > > > > > > DocValues is really just a copy of the original source text
> and is
> > > > not
> > > > > > > tokenized into terms.
> > > > > > >
> > > > > > > Maybe you could explain how exactly you are using term
> frequency in
> > > > > > > function queries. More importantly, what is so "heavy" about
> your
> > > > > usage?
> > > > > > > Generally, moderate use of a feature is much more advisable to
> > > heavy
> > > > > usage,
> > > > > > > unless you don't care about performance.
> > > > > > >
> > > > > > > -- Jack Krupansky
> > > > > > >
> > > > > > > On Fri, Oct 23, 2015 at 8:19 AM, Aki Balogh <
> a...@marketmuse.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hello,
> > > > > > > >
> > > > > > > > In our solr application, we use 

Re: any clean test failing

2015-10-24 Thread William Bell
OK I deleted /home/solr/.ivy2 and it started working.

On Sat, Oct 24, 2015 at 11:57 AM, William Bell  wrote:

> It is getting stuck on resolve.
>
> ant clean test
>
> SOLR 5.3.1
>
> [ivy:retrieve] retrieve done (5ms)
>
> Overriding previous definition of property "ivy.version"
>
> [ivy:retrieve] no resolved descriptor found: launching default resolve
>
> Overriding previous definition of property "ivy.version"
>
> [ivy:retrieve] using ivy parser to parse
> file:/home/solr/src/lucene_solr_5_3_1A/solr/server/ivy.xml
>
> [ivy:retrieve] :: resolving dependencies :: org.apache.solr#
> example;work...@hgsolr2devmstr.healthgrades.com
>
> [ivy:retrieve] confs: [logging]
>
> [ivy:retrieve] validate = true
>
> [ivy:retrieve] refresh = false
>
> [ivy:retrieve] resolving dependencies for configuration 'logging'
>
> [ivy:retrieve] == resolving dependencies for org.apache.solr#
> example;work...@hgsolr2devmstr.healthgrades.com [logging]
>
> [ivy:retrieve] == resolving dependencies
> org.apache.solr#example;work...@hgsolr2devmstr.healthgrades.com->log4j#log4j;1.2.17
> [logging->master]
>
> [ivy:retrieve] default: Checking cache for: dependency: log4j#log4j;1.2.17
> {logging=[master]}
>
> [ivy:retrieve] don't use cache for log4j#log4j;1.2.17: checkModified=true
>
> [ivy:retrieve] tried
> /home/solr/.ivy2/local/log4j/log4j/1.2.17/ivys/ivy.xml
>
> [ivy:retrieve] tried
> /home/solr/.ivy2/local/log4j/log4j/1.2.17/jars/log4j.jar
>
> [ivy:retrieve] local: no ivy file nor artifact found for
> log4j#log4j;1.2.17
>
> [ivy:retrieve] main: Checking cache for: dependency: log4j#log4j;1.2.17
> {logging=[master]}
>
> [ivy:retrieve] main: module revision found in cache: log4j#log4j;1.2.17
>
> [ivy:retrieve] found log4j#log4j;1.2.17 in public
>
> [ivy:retrieve] == resolving dependencies
> org.apache.solr#example;work...@hgsolr2devmstr.healthgrades.com->org.slf4j#slf4j-api;1.7.7
> [logging->master]
>
> [ivy:retrieve] default: Checking cache for: dependency:
> org.slf4j#slf4j-api;1.7.7 {logging=[master]}
>
> [ivy:retrieve] don't use cache for org.slf4j#slf4j-api;1.7.7:
> checkModified=true
>
> [ivy:retrieve] tried
> /home/solr/.ivy2/local/org.slf4j/slf4j-api/1.7.7/ivys/ivy.xml
>
> [ivy:retrieve] tried
> /home/solr/.ivy2/local/org.slf4j/slf4j-api/1.7.7/jars/slf4j-api.jar
>
> [ivy:retrieve] local: no ivy file nor artifact found for
> org.slf4j#slf4j-api;1.7.7
>
> [ivy:retrieve] main: Checking cache for: dependency:
> org.slf4j#slf4j-api;1.7.7 {logging=[master]}
>
> [ivy:retrieve] main: module revision found in cache:
> org.slf4j#slf4j-api;1.7.7
>
> [ivy:retrieve] found org.slf4j#slf4j-api;1.7.7 in public
>
> [ivy:retrieve] == resolving dependencies
> org.apache.solr#example;work...@hgsolr2devmstr.healthgrades.com->org.slf4j#jcl-over-slf4j;1.7.7
> [logging->master]
>
> [ivy:retrieve] default: Checking cache for: dependency:
> org.slf4j#jcl-over-slf4j;1.7.7 {logging=[master]}
>
> [ivy:retrieve] don't use cache for org.slf4j#jcl-over-slf4j;1.7.7:
> checkModified=true
>
> --
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076
>



-- 
Bill Bell
billnb...@gmail.com
cell 720-256-8076


Re: Order of actions in Update request

2015-10-24 Thread Shawn Heisey
On 10/24/2015 5:21 AM, Jamie Johnson wrote:
> Looking at the code and jira I see that ordering actions in solrj update
> request is currently not supported but I'd like to know if there is any
> other way to get this capability.  I took a quick look at the XML loader
> and it appears to process actions as it sees them so if the order was
> changed to order the actions as
> 
> Add
> Delete
> Add
> 
> Vs
> Add
> Add
> Delete
> 
> Would this cause any issues with the update?  Would it achieve the desired
> result?  Are there any other options for ordering actions as they were
> provided to the update request?

If those three actions are in separate update requests using
HttpSolrClient or CloudSolrClient in a single thread, I would expect
them to be executed in the order you make the requests.  If you're using
multiple threads, then you probably cannot guarantee the order of the
requests.

Are you using one of those clients in a single thread and seeing
something other than what I have described?  If so, I think that might
be a bug.

If you're using ConcurrentUpdateSolrClient, I don't think you can
guarantee order.  That client has multiple threads pulling the requests
out of an internal queue.  If some requests complete substantially
faster than others, they could happen out of order.  The concurrent
client is a poor choice for anything but bulk inserts, and because of
the fact that it ignores almost every error that happens while it runs,
it often is not a good choice for that either.

Thanks,
Shawn



Re: Does docValues impact termfreq ?

2015-10-24 Thread Upayavira
Can you explain more what you are using TF for? Because it sounds rather
like scoring. You could disable field norms and IDF and scoring would be
mostly TF, no?

Upayavira

On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
> Thanks, let me think about that.
> 
> We're using termfreq to get the TF score, but we don't know which term
> we'll need the TF for. So we'd have to do a corpuswide summing of
> termfreq
> for each potential term across all documents in the corpus. It seems like
> it'd require some development work to compute that, and our code would be
> fragile.
> 
> Let me think about that more.
> 
> It might make sense to just move to solrcloud, it's the right
> architectural
> decision anyway.
> 
> 
> On Sat, Oct 24, 2015 at 1:54 PM, Upayavira  wrote:
> 
> > If you just want word length, then do work during indexing - index a
> > field for the word length. Then, I believe you can do faceting - e.g.
> > with the json faceting API I believe you can do a sum() calculation on a
> > field rather than the more traditional count.
> >
> > Thinking aloud, there might be an easier way - index a field that is the
> > same for all documents, and facet on it. Instead of counting the number
> > of documents, calculate the sum() of your word count field.
> >
> > I *think* that should work.
> >
> > Upayavira
> >
> > On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
> > > Hi Jack,
> > >
> > > I'm just using solr to get word count across a large number of documents.
> > >
> > > It's somewhat non-standard, because we're ignoring relevance, but it
> > > seems
> > > to work well for this use case otherwise.
> > >
> > > My understanding then is:
> > > 1) since termfreq is pre-processed and fetched, there's no good way to
> > > speed it up (except by caching earlier calculations)
> > >
> > > 2) there's no way to have solr sum up all of the termfreqs across all
> > > documents in a search and just return one number for total termfreqs
> > >
> > >
> > > Are these correct?
> > >
> > > Thanks,
> > > Aki
> > >
> > >
> > > On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
> > > 
> > > wrote:
> > >
> > > > That's what a normal query does - Lucene takes all the terms used in
> > the
> > > > query and sums them up for each document in the response, producing a
> > > > single number, the score, for each document. That's the way Solr is
> > > > designed to be used. You still haven't elaborated why you are trying
> > to use
> > > > Solr in a way other than it was intended.
> > > >
> > > > -- Jack Krupansky
> > > >
> > > > On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh 
> > wrote:
> > > >
> > > > > Gotcha - that's disheartening.
> > > > >
> > > > > One idea: when I run termfreq, I get all of the termfreqs for each
> > > > document
> > > > > one-by-one.
> > > > >
> > > > > Is there a way to have solr sum it up before creating the request,
> > so I
> > > > > only receive one number in the response?
> > > > >
> > > > >
> > > > > On Sat, Oct 24, 2015 at 11:05 AM, Upayavira  wrote:
> > > > >
> > > > > > If you mean using the term frequency function query, then I'm not
> > sure
> > > > > > there's a huge amount you can do to improve performance.
> > > > > >
> > > > > > The term frequency is a number that is used often, so it is stored
> > in
> > > > > > the index pre-calculated. Perhaps, if your data is not changing,
> > > > > > optimising your index would reduce it to one segment, and thus
> > might
> > > > > > ever so slightly speed the aggregation of term frequencies, but I
> > doubt
> > > > > > it'd make enough difference to make it worth doing.
> > > > > >
> > > > > > Upayavira
> > > > > >
> > > > > > On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> > > > > > > Thanks, Jack. I did some more research and found similar results.
> > > > > > >
> > > > > > > In our application, we are making multiple (think: 50) concurrent
> > > > > > > requests
> > > > > > > to calculate term frequency on a set of documents in
> > "real-time". The
> > > > > > > faster that results return, the better.
> > > > > > >
> > > > > > > Most of these requests are unique, so cache only helps slightly.
> > > > > > >
> > > > > > > This analysis is happening on a single solr instance.
> > > > > > >
> > > > > > > Other than moving to solr cloud and splitting out the processing
> > onto
> > > > > > > multiple servers, do you have any suggestions for what might
> > speed up
> > > > > > > termfreq at query time?
> > > > > > >
> > > > > > > Thanks,
> > > > > > > Aki
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Oct 23, 2015 at 7:21 PM, Jack Krupansky
> > > > > > > 
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Term frequency applies only to the indexed terms of a tokenized
> > > > > field.
> > > > > > > > DocValues is really just a copy of the original source text
> > and is
> > > > > not
> > > > > > > > tokenized into terms.
> > > > > > > >
> > > > > > > > Maybe you could explain how exactly you are using term
> > frequency in
> > > > > > > > function 

Re: Does docValues impact termfreq ?

2015-10-24 Thread Aki Balogh
Yes, sorry, I am not being clear.

We are not even doing scoring, just getting the raw TF values. We're doing
this in solr because it can scale well.

But with large corpora, retrieving the word counts takes some time, in part
because solr is splitting up word count by document and generating a large
request. We then get the request and just sum it all up. I'm wondering if
there's a more direct way.
On Oct 24, 2015 4:00 PM, "Upayavira"  wrote:

> Can you explain more what you are using TF for? Because it sounds rather
> like scoring. You could disable field norms and IDF and scoring would be
> mostly TF, no?
>
> Upayavira
>
> On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
> > Thanks, let me think about that.
> >
> > We're using termfreq to get the TF score, but we don't know which term
> > we'll need the TF for. So we'd have to do a corpuswide summing of
> > termfreq
> > for each potential term across all documents in the corpus. It seems like
> > it'd require some development work to compute that, and our code would be
> > fragile.
> >
> > Let me think about that more.
> >
> > It might make sense to just move to solrcloud, it's the right
> > architectural
> > decision anyway.
> >
> >
> > On Sat, Oct 24, 2015 at 1:54 PM, Upayavira  wrote:
> >
> > > If you just want word length, then do work during indexing - index a
> > > field for the word length. Then, I believe you can do faceting - e.g.
> > > with the json faceting API I believe you can do a sum() calculation on
> a
> > > field rather than the more traditional count.
> > >
> > > Thinking aloud, there might be an easier way - index a field that is
> the
> > > same for all documents, and facet on it. Instead of counting the number
> > > of documents, calculate the sum() of your word count field.
> > >
> > > I *think* that should work.
> > >
> > > Upayavira
> > >
> > > On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
> > > > Hi Jack,
> > > >
> > > > I'm just using solr to get word count across a large number of
> documents.
> > > >
> > > > It's somewhat non-standard, because we're ignoring relevance, but it
> > > > seems
> > > > to work well for this use case otherwise.
> > > >
> > > > My understanding then is:
> > > > 1) since termfreq is pre-processed and fetched, there's no good way
> to
> > > > speed it up (except by caching earlier calculations)
> > > >
> > > > 2) there's no way to have solr sum up all of the termfreqs across all
> > > > documents in a search and just return one number for total termfreqs
> > > >
> > > >
> > > > Are these correct?
> > > >
> > > > Thanks,
> > > > Aki
> > > >
> > > >
> > > > On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
> > > > 
> > > > wrote:
> > > >
> > > > > That's what a normal query does - Lucene takes all the terms used
> in
> > > the
> > > > > query and sums them up for each document in the response,
> producing a
> > > > > single number, the score, for each document. That's the way Solr is
> > > > > designed to be used. You still haven't elaborated why you are
> trying
> > > to use
> > > > > Solr in a way other than it was intended.
> > > > >
> > > > > -- Jack Krupansky
> > > > >
> > > > > On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh 
> > > wrote:
> > > > >
> > > > > > Gotcha - that's disheartening.
> > > > > >
> > > > > > One idea: when I run termfreq, I get all of the termfreqs for
> each
> > > > > document
> > > > > > one-by-one.
> > > > > >
> > > > > > Is there a way to have solr sum it up before creating the
> request,
> > > so I
> > > > > > only receive one number in the response?
> > > > > >
> > > > > >
> > > > > > On Sat, Oct 24, 2015 at 11:05 AM, Upayavira 
> wrote:
> > > > > >
> > > > > > > If you mean using the term frequency function query, then I'm
> not
> > > sure
> > > > > > > there's a huge amount you can do to improve performance.
> > > > > > >
> > > > > > > The term frequency is a number that is used often, so it is
> stored
> > > in
> > > > > > > the index pre-calculated. Perhaps, if your data is not
> changing,
> > > > > > > optimising your index would reduce it to one segment, and thus
> > > might
> > > > > > > ever so slightly speed the aggregation of term frequencies,
> but I
> > > doubt
> > > > > > > it'd make enough difference to make it worth doing.
> > > > > > >
> > > > > > > Upayavira
> > > > > > >
> > > > > > > On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> > > > > > > > Thanks, Jack. I did some more research and found similar
> results.
> > > > > > > >
> > > > > > > > In our application, we are making multiple (think: 50)
> concurrent
> > > > > > > > requests
> > > > > > > > to calculate term frequency on a set of documents in
> > > "real-time". The
> > > > > > > > faster that results return, the better.
> > > > > > > >
> > > > > > > > Most of these requests are unique, so cache only helps
> slightly.
> > > > > > > >
> > > > > > > > This analysis is happening on a single solr instance.
> > > > > > > >
> > > > > > > > Other than moving to solr cloud and splittin

Re: Does docValues impact termfreq ?

2015-10-24 Thread Upayavira
yes, but what do you want to do with the TF? What problem are you
solving with it? If you are able to share that...

On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:
> Yes, sorry, I am not being clear.
> 
> We are not even doing scoring, just getting the raw TF values. We're
> doing
> this in solr because it can scale well.
> 
> But with large corpora, retrieving the word counts takes some time, in
> part
> because solr is splitting up word count by document and generating a
> large
> request. We then get the request and just sum it all up. I'm wondering if
> there's a more direct way.
> On Oct 24, 2015 4:00 PM, "Upayavira"  wrote:
> 
> > Can you explain more what you are using TF for? Because it sounds rather
> > like scoring. You could disable field norms and IDF and scoring would be
> > mostly TF, no?
> >
> > Upayavira
> >
> > On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
> > > Thanks, let me think about that.
> > >
> > > We're using termfreq to get the TF score, but we don't know which term
> > > we'll need the TF for. So we'd have to do a corpuswide summing of
> > > termfreq
> > > for each potential term across all documents in the corpus. It seems like
> > > it'd require some development work to compute that, and our code would be
> > > fragile.
> > >
> > > Let me think about that more.
> > >
> > > It might make sense to just move to solrcloud, it's the right
> > > architectural
> > > decision anyway.
> > >
> > >
> > > On Sat, Oct 24, 2015 at 1:54 PM, Upayavira  wrote:
> > >
> > > > If you just want word length, then do work during indexing - index a
> > > > field for the word length. Then, I believe you can do faceting - e.g.
> > > > with the json faceting API I believe you can do a sum() calculation on
> > a
> > > > field rather than the more traditional count.
> > > >
> > > > Thinking aloud, there might be an easier way - index a field that is
> > the
> > > > same for all documents, and facet on it. Instead of counting the number
> > > > of documents, calculate the sum() of your word count field.
> > > >
> > > > I *think* that should work.
> > > >
> > > > Upayavira
> > > >
> > > > On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
> > > > > Hi Jack,
> > > > >
> > > > > I'm just using solr to get word count across a large number of
> > documents.
> > > > >
> > > > > It's somewhat non-standard, because we're ignoring relevance, but it
> > > > > seems
> > > > > to work well for this use case otherwise.
> > > > >
> > > > > My understanding then is:
> > > > > 1) since termfreq is pre-processed and fetched, there's no good way
> > to
> > > > > speed it up (except by caching earlier calculations)
> > > > >
> > > > > 2) there's no way to have solr sum up all of the termfreqs across all
> > > > > documents in a search and just return one number for total termfreqs
> > > > >
> > > > >
> > > > > Are these correct?
> > > > >
> > > > > Thanks,
> > > > > Aki
> > > > >
> > > > >
> > > > > On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
> > > > > 
> > > > > wrote:
> > > > >
> > > > > > That's what a normal query does - Lucene takes all the terms used
> > in
> > > > the
> > > > > > query and sums them up for each document in the response,
> > producing a
> > > > > > single number, the score, for each document. That's the way Solr is
> > > > > > designed to be used. You still haven't elaborated why you are
> > trying
> > > > to use
> > > > > > Solr in a way other than it was intended.
> > > > > >
> > > > > > -- Jack Krupansky
> > > > > >
> > > > > > On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh 
> > > > wrote:
> > > > > >
> > > > > > > Gotcha - that's disheartening.
> > > > > > >
> > > > > > > One idea: when I run termfreq, I get all of the termfreqs for
> > each
> > > > > > document
> > > > > > > one-by-one.
> > > > > > >
> > > > > > > Is there a way to have solr sum it up before creating the
> > request,
> > > > so I
> > > > > > > only receive one number in the response?
> > > > > > >
> > > > > > >
> > > > > > > On Sat, Oct 24, 2015 at 11:05 AM, Upayavira 
> > wrote:
> > > > > > >
> > > > > > > > If you mean using the term frequency function query, then I'm
> > not
> > > > sure
> > > > > > > > there's a huge amount you can do to improve performance.
> > > > > > > >
> > > > > > > > The term frequency is a number that is used often, so it is
> > stored
> > > > in
> > > > > > > > the index pre-calculated. Perhaps, if your data is not
> > changing,
> > > > > > > > optimising your index would reduce it to one segment, and thus
> > > > might
> > > > > > > > ever so slightly speed the aggregation of term frequencies,
> > but I
> > > > doubt
> > > > > > > > it'd make enough difference to make it worth doing.
> > > > > > > >
> > > > > > > > Upayavira
> > > > > > > >
> > > > > > > > On Sat, Oct 24, 2015, at 03:37 PM, Aki Balogh wrote:
> > > > > > > > > Thanks, Jack. I did some more research and found similar
> > results.
> > > > > > > > >
> > > > > > > > > In our application, we are making multiple (think: 50)

Re: Does docValues impact termfreq ?

2015-10-24 Thread Aki Balogh
Certainly, yes. I'm just doing a word count, ie how often does a specific
term come up in the corpus?
On Oct 24, 2015 4:20 PM, "Upayavira"  wrote:

> yes, but what do you want to do with the TF? What problem are you
> solving with it? If you are able to share that...
>
> On Sat, Oct 24, 2015, at 09:05 PM, Aki Balogh wrote:
> > Yes, sorry, I am not being clear.
> >
> > We are not even doing scoring, just getting the raw TF values. We're
> > doing
> > this in solr because it can scale well.
> >
> > But with large corpora, retrieving the word counts takes some time, in
> > part
> > because solr is splitting up word count by document and generating a
> > large
> > request. We then get the request and just sum it all up. I'm wondering if
> > there's a more direct way.
> > On Oct 24, 2015 4:00 PM, "Upayavira"  wrote:
> >
> > > Can you explain more what you are using TF for? Because it sounds
> rather
> > > like scoring. You could disable field norms and IDF and scoring would
> be
> > > mostly TF, no?
> > >
> > > Upayavira
> > >
> > > On Sat, Oct 24, 2015, at 07:28 PM, Aki Balogh wrote:
> > > > Thanks, let me think about that.
> > > >
> > > > We're using termfreq to get the TF score, but we don't know which
> term
> > > > we'll need the TF for. So we'd have to do a corpuswide summing of
> > > > termfreq
> > > > for each potential term across all documents in the corpus. It seems
> like
> > > > it'd require some development work to compute that, and our code
> would be
> > > > fragile.
> > > >
> > > > Let me think about that more.
> > > >
> > > > It might make sense to just move to solrcloud, it's the right
> > > > architectural
> > > > decision anyway.
> > > >
> > > >
> > > > On Sat, Oct 24, 2015 at 1:54 PM, Upayavira  wrote:
> > > >
> > > > > If you just want word length, then do work during indexing - index
> a
> > > > > field for the word length. Then, I believe you can do faceting -
> e.g.
> > > > > with the json faceting API I believe you can do a sum()
> calculation on
> > > a
> > > > > field rather than the more traditional count.
> > > > >
> > > > > Thinking aloud, there might be an easier way - index a field that
> is
> > > the
> > > > > same for all documents, and facet on it. Instead of counting the
> number
> > > > > of documents, calculate the sum() of your word count field.
> > > > >
> > > > > I *think* that should work.
> > > > >
> > > > > Upayavira
> > > > >
> > > > > On Sat, Oct 24, 2015, at 04:24 PM, Aki Balogh wrote:
> > > > > > Hi Jack,
> > > > > >
> > > > > > I'm just using solr to get word count across a large number of
> > > documents.
> > > > > >
> > > > > > It's somewhat non-standard, because we're ignoring relevance,
> but it
> > > > > > seems
> > > > > > to work well for this use case otherwise.
> > > > > >
> > > > > > My understanding then is:
> > > > > > 1) since termfreq is pre-processed and fetched, there's no good
> way
> > > to
> > > > > > speed it up (except by caching earlier calculations)
> > > > > >
> > > > > > 2) there's no way to have solr sum up all of the termfreqs
> across all
> > > > > > documents in a search and just return one number for total
> termfreqs
> > > > > >
> > > > > >
> > > > > > Are these correct?
> > > > > >
> > > > > > Thanks,
> > > > > > Aki
> > > > > >
> > > > > >
> > > > > > On Sat, Oct 24, 2015 at 11:20 AM, Jack Krupansky
> > > > > > 
> > > > > > wrote:
> > > > > >
> > > > > > > That's what a normal query does - Lucene takes all the terms
> used
> > > in
> > > > > the
> > > > > > > query and sums them up for each document in the response,
> > > producing a
> > > > > > > single number, the score, for each document. That's the way
> Solr is
> > > > > > > designed to be used. You still haven't elaborated why you are
> > > trying
> > > > > to use
> > > > > > > Solr in a way other than it was intended.
> > > > > > >
> > > > > > > -- Jack Krupansky
> > > > > > >
> > > > > > > On Sat, Oct 24, 2015 at 11:13 AM, Aki Balogh <
> a...@marketmuse.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Gotcha - that's disheartening.
> > > > > > > >
> > > > > > > > One idea: when I run termfreq, I get all of the termfreqs for
> > > each
> > > > > > > document
> > > > > > > > one-by-one.
> > > > > > > >
> > > > > > > > Is there a way to have solr sum it up before creating the
> > > request,
> > > > > so I
> > > > > > > > only receive one number in the response?
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sat, Oct 24, 2015 at 11:05 AM, Upayavira 
> > > wrote:
> > > > > > > >
> > > > > > > > > If you mean using the term frequency function query, then
> I'm
> > > not
> > > > > sure
> > > > > > > > > there's a huge amount you can do to improve performance.
> > > > > > > > >
> > > > > > > > > The term frequency is a number that is used often, so it is
> > > stored
> > > > > in
> > > > > > > > > the index pre-calculated. Perhaps, if your data is not
> > > changing,
> > > > > > > > > optimising your index would reduce it to one segment, and
> thus
> > > > > might
> > > >

Re: any clean test failing

2015-10-24 Thread Erick Erickson
I've been seeing this happen a lot lately, it seems like a series of lock
files are left around under some conditions. I've also incorporated some
of Mark Miller's suggestions, but perhaps one of my upgrades undid
that work.

I've found it much less painful to remove all the *.lck files, I don't
have to wait for a _long_ time to get all the files backI use

find . -name "*.lck" | xargs rm

there are other ways too.

On Sat, Oct 24, 2015 at 12:12 PM, William Bell  wrote:
> OK I deleted /home/solr/.ivy2 and it started working.
>
> On Sat, Oct 24, 2015 at 11:57 AM, William Bell  wrote:
>
>> It is getting stuck on resolve.
>>
>> ant clean test
>>
>> SOLR 5.3.1
>>
>> [ivy:retrieve] retrieve done (5ms)
>>
>> Overriding previous definition of property "ivy.version"
>>
>> [ivy:retrieve] no resolved descriptor found: launching default resolve
>>
>> Overriding previous definition of property "ivy.version"
>>
>> [ivy:retrieve] using ivy parser to parse
>> file:/home/solr/src/lucene_solr_5_3_1A/solr/server/ivy.xml
>>
>> [ivy:retrieve] :: resolving dependencies :: org.apache.solr#
>> example;work...@hgsolr2devmstr.healthgrades.com
>>
>> [ivy:retrieve] confs: [logging]
>>
>> [ivy:retrieve] validate = true
>>
>> [ivy:retrieve] refresh = false
>>
>> [ivy:retrieve] resolving dependencies for configuration 'logging'
>>
>> [ivy:retrieve] == resolving dependencies for org.apache.solr#
>> example;work...@hgsolr2devmstr.healthgrades.com [logging]
>>
>> [ivy:retrieve] == resolving dependencies
>> org.apache.solr#example;work...@hgsolr2devmstr.healthgrades.com->log4j#log4j;1.2.17
>> [logging->master]
>>
>> [ivy:retrieve] default: Checking cache for: dependency: log4j#log4j;1.2.17
>> {logging=[master]}
>>
>> [ivy:retrieve] don't use cache for log4j#log4j;1.2.17: checkModified=true
>>
>> [ivy:retrieve] tried
>> /home/solr/.ivy2/local/log4j/log4j/1.2.17/ivys/ivy.xml
>>
>> [ivy:retrieve] tried
>> /home/solr/.ivy2/local/log4j/log4j/1.2.17/jars/log4j.jar
>>
>> [ivy:retrieve] local: no ivy file nor artifact found for
>> log4j#log4j;1.2.17
>>
>> [ivy:retrieve] main: Checking cache for: dependency: log4j#log4j;1.2.17
>> {logging=[master]}
>>
>> [ivy:retrieve] main: module revision found in cache: log4j#log4j;1.2.17
>>
>> [ivy:retrieve] found log4j#log4j;1.2.17 in public
>>
>> [ivy:retrieve] == resolving dependencies
>> org.apache.solr#example;work...@hgsolr2devmstr.healthgrades.com->org.slf4j#slf4j-api;1.7.7
>> [logging->master]
>>
>> [ivy:retrieve] default: Checking cache for: dependency:
>> org.slf4j#slf4j-api;1.7.7 {logging=[master]}
>>
>> [ivy:retrieve] don't use cache for org.slf4j#slf4j-api;1.7.7:
>> checkModified=true
>>
>> [ivy:retrieve] tried
>> /home/solr/.ivy2/local/org.slf4j/slf4j-api/1.7.7/ivys/ivy.xml
>>
>> [ivy:retrieve] tried
>> /home/solr/.ivy2/local/org.slf4j/slf4j-api/1.7.7/jars/slf4j-api.jar
>>
>> [ivy:retrieve] local: no ivy file nor artifact found for
>> org.slf4j#slf4j-api;1.7.7
>>
>> [ivy:retrieve] main: Checking cache for: dependency:
>> org.slf4j#slf4j-api;1.7.7 {logging=[master]}
>>
>> [ivy:retrieve] main: module revision found in cache:
>> org.slf4j#slf4j-api;1.7.7
>>
>> [ivy:retrieve] found org.slf4j#slf4j-api;1.7.7 in public
>>
>> [ivy:retrieve] == resolving dependencies
>> org.apache.solr#example;work...@hgsolr2devmstr.healthgrades.com->org.slf4j#jcl-over-slf4j;1.7.7
>> [logging->master]
>>
>> [ivy:retrieve] default: Checking cache for: dependency:
>> org.slf4j#jcl-over-slf4j;1.7.7 {logging=[master]}
>>
>> [ivy:retrieve] don't use cache for org.slf4j#jcl-over-slf4j;1.7.7:
>> checkModified=true
>>
>> --
>> Bill Bell
>> billnb...@gmail.com
>> cell 720-256-8076
>>
>
>
>
> --
> Bill Bell
> billnb...@gmail.com
> cell 720-256-8076


RE: DIH Caching with Delta Import

2015-10-24 Thread Todd Long
Dyer, James-2 wrote
> The DIH Cache feature does not work with delta import.  Actually, much of
> DIH does not work with delta import.  The workaround you describe is
> similar to the approach described here:
> https://wiki.apache.org/solr/DataImportHandlerDeltaQueryViaFullImport ,
> which in my opinion is the best way to implement partial updates with DIH.

Not what I was hoping to hear but at least that explains the delta import
funkyness we were experiencing. Thank you for providing the partial updates
implementation link.



--
View this message in context: 
http://lucene.472066.n3.nabble.com/DIH-Caching-with-Delta-Import-tp4235598p4236384.html
Sent from the Solr - User mailing list archive at Nabble.com.


Using the ExtractRequestHandler

2015-10-24 Thread Salonee Rege
Hi,
   We are using Solr and need help using the ExtractRequestHandler wherein
we cannot decide what input parameters we need to specify.Kindly help
*Salonee Rege*
USC Viterbi School of Engineering
University of Southern California
Master of Computer Science - Student
Computer Science - B.E
salon...@usc.edu  *||* *619-709-6756*


Re: EdgeNGramFilterFactory for Chinese characters

2015-10-24 Thread Tomoko Uchida
> I have rich-text documents that are in both English and Chinese, and
> currently I have EdgeNGramFilterFactory enabled during indexing, as I need
> it for partial matching for English words. But this means it will also
> break up each of the Chinese characters into different tokens.

EdgeNGramFilterFactory creates sub-strings (prefixes) from each token. Its
behavior is independent of language.
If you need to perform partial (prefix) match for **only English words**,
you can create a separate field that keeps only English words (I've never
tried that, but might be possible by PatternTokenizerFactory or other
tokenizer/filter chains...,) and apply EdgeNGramFilterFactory to the field.

Hope it helps,
Tomoko

2015-10-23 13:04 GMT+09:00 Zheng Lin Edwin Yeo :

> Hi,
>
> Would like to check, is it good to use EdgeNGramFilterFactory for indexes
> that contains Chinese characters?
> Will it affect the accuracy of the search for Chinese words?
>
> I have rich-text documents that are in both English and Chinese, and
> currently I have EdgeNGramFilterFactory enabled during indexing, as I need
> it for partial matching for English words. But this means it will also
> break up each of the Chinese characters into different tokens.
>
> I'm using the HMMChineseTokenizerFactory for my tokenizer.
>
> Thank you.
>
> Regards,
> Edwin
>