Re: Scoring, payloads and phrase queries

2015-07-25 Thread Mikhail Khludnev
Does PayloadNearQuery suite for it?

On Fri, Jul 24, 2015 at 5:41 PM, Jamie Johnson  wrote:

> Is there a way to consider payloads for scoring in phrase queries like
> exists in PayloadTermQuery?
>



-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics





Re: Scoring, payloads and phrase queries

2015-07-25 Thread Jamie Johnson
Thanks Mikhail! I had seen this but had originally thought it wouldn't be
usable.  That said I think I was wrong.  I have an example that rewrites a
phrase query as a SpanQuery and then uses the PayloadNearQuery which seems
to work correctly.  I have done something similar for MultiPhraseQuery
(though I am not sure it is right at this point as I don't understand the
usage of the positions in the class at this point).  My first cut is shown
below (PF is just a PayloadFunction and not of much interest).  Does this
look correct?

MultiPhraseQuery phrase = (MultiPhraseQuery)query;
List terms = phrase.getTermArrays();
SpanQuery[] topLevelSpans = new SpanQuery[terms.size()];
for(int j = 0; j < terms.size(); j++) {
Term[] internalTerms = terms.get(j);
SpanQuery[] sq = new SpanQuery[internalTerms.length];
for(int i = 0; i < internalTerms.length; i++) {
sq[i] = new SpanTermQuery(internalTerms[i]);
}
topLevelSpans[j]= new SpanOrQuery(sq);
}
PayloadNearQuery pnq = new PayloadNearQuery(topLevelSpans,
phrase.getSlop(), true, new PF());
pnq.setBoost(phrase.getBoost());


It looks like to support Payloads in all the query types I would like to
support I'll need to rewrite the queries (or their pieces) to a
PayloadNearQuery or a PayloadTermQuery.  Is there a PayloadMultiTermQuery
that Fuzzy, Range, Wildcard, etc. type of queries could be rewritten to?
Again thanks I really appreciate the pointer.


On Jul 25, 2015 5:22 AM, "Mikhail Khludnev" 
wrote:

> Does PayloadNearQuery suite for it?
>
> On Fri, Jul 24, 2015 at 5:41 PM, Jamie Johnson  wrote:
>
> > Is there a way to consider payloads for scoring in phrase queries like
> > exists in PayloadTermQuery?
> >
>
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> Principal Engineer,
> Grid Dynamics
>
> 
> 
>


Re: Unexpected docvalues type error using result grouping - Use UninvertingReader or index with docvalues

2015-07-25 Thread Erick Erickson
Simply put, trying to cut corners and intuit what would be OK when
changing the schema by NOT reindexing from scratch when you are
_not_ completely familiar with the low-level details of Lucene is an recipe
for problems. As you are finding out and Shawn explained.

Think of it this way. The schema.xml is the theory, what's actually _in_
the segments is the reality. Lucene does not impose any uniformity
at all, Solr does based on the schema file. But that's "by convention",
i.e. by creating Lucene fields in a predictable, uniform way. Which means
that changing the schema can write the new segments with wholly new
assumptions that aren't reconcilable with the old segments.

And the fact that you've deleted docs of type A and B means nothing. All
that really happened is that the docs were _marked_ as deleted. The
underlying segments still have the old data (and assumptions). So the
traces of the original definitions are in the segments files and are
possibly incompatible with the new docs written to new segments.
Like Shawn, I have no real clue whether even optimizing would make
any difference. So don't go there would be my take.

This is one of those things that you really have to "just live with" with
Solr/Lucene.

Best,
Erick

On Fri, Jul 24, 2015 at 3:57 PM, Shawn Heisey  wrote:
> On 7/24/2015 3:48 PM, shamik wrote:
>> Here's the part which I'm not able to understand. I've for e.g. Source A, B,
>> C and D in index. Each source contains "n" number of documents. Now, out of
>> these, a bunch of documents in A and B are tagged with MediaType. I took the
>> following steps:
>>
>> 1. Delete all documents tagged with MediaType for A and B. Documents from C
>> and D are not touched.
>>
>> 2. Re-Index documents which were tagged with MediaType
>>
>> 3. Run Optimization
>>
>> Still, I keep seeing this exception. Does this mean, content from C and D
>> are impacted even though they are not tagged with MediaType ?
>
> Do any docs from C and D have that field?  Never mind whether you need
> to run your operation on them ... do they have the field?  If so, then
> when the facet code (which knows about the schema and the fact that it
> has docValues) looks at those segments, they do not have *any* docValues
> tagging for that field.  This likely would cause big explosions.  This
> lack of docValues tagging probably survives an optimize.
>
> Even if they don't have the field, there may be something about the
> Lucene format that the docValues support just doesn't like when the
> original docs were indexed without docValues on that field.
>
> Rebuilding the *entire* index is recommended for most schema changes,
> especially those like docValues that affect very low-level code
> implementations.  Solr hides lots of low-level Lucene details from the
> administrator, but makes use of those details to do its job.  Making
> sure your config and schema match what was present when the index was
> built is sometimes critical.
>
> Thanks,
> Shawn
>


Re: term frequency with stemming

2015-07-25 Thread Aki Balogh
I believe I found a solution: use a third-party stemmer to stem the term
first, then pass it to termfreq.

The only trick is, each term in a phrase has to be stemmed separately (i.e.
"end-user experience" has to be broken down into "end-user" -> "end-us" and
"experience" -> "experi") before being passed, i.e. termfreq(body, "end-us
experi").

>From what I can tell, FunctionQuery / termfreq doesn't have a way to apply
stemming.

Akos (Aki) Balogh
Co-Founder, MarketMuse
https://www.MarketMuse.com 


On Fri, Jul 24, 2015 at 12:04 PM, Aki Balogh  wrote:

> Hi All,
>
> I'm using TermVectorComponent and stemming (Porter) in order to get term
> frequencies with fuzzy matching. I'm stemming at index and query time.
>
> Is there a way to get term frequency from the index?
> * termfreq doesn't support stemming or wildcards
> * terms component doesn't allow additional filters
> * I could use a copyfield to save a non-stemmed version at indexing, and
> run termfreq on that, but then I don't get any fuzzy matching
>
> Thanks,
> Aki
>


Re: serious JSON Facet bug

2015-07-25 Thread naga sharathrayapati
Yonik,

Did you see this issue with 5.2 as well or only 5.1?

Thanks,
Naga

On Fri, Jul 24, 2015 at 9:15 PM, Yonik Seeley  wrote:

> On Fri, Jul 24, 2015 at 8:03 PM, Nagasharath 
> wrote:
> > Is there a jira logged for this issue?
>
> * SOLR-7781: JSON Facet API: Terms facet on string/text fields with
> sub-facets caused
>   a bug that resulted in filter cache lookup misses as well as the filter
> cache
>   exceeding it's configured size. (yonik)
>
> https://issues.apache.org/jira/browse/SOLR-7781
>
> -Yonik
>