Re: catchall fields or multiple fields

2015-10-12 Thread Trey Grainger
Elisabeth,

Yes, it will almost always be more efficient to search within a catch-all
field than to search across multiple fields. Think of it this way: when you
search on a single field, you are doing a single keyword search against the
index per term. When you search across multiple fields, you are executing
the search for that term multiple times (once for each field) against the
index, and then doing the necessary intersections/unions/etc. of the
document sets.

As you continue to add more and more fields to search across, the search
continues to grow slower. If you're only searching a few fields then it
will probably not be noticeably slower, but the more and more you add, the
slower your response times will become. This slowdown may be measured in
milliseconds, in which case you may not care, but it will be slower.

The idf point you mentioned can be both a pro and a con depending upon the
use case. For example, if you are searching news content that has a
"french_text" field and an "english_text" field, it would be suboptimal if
for the search "Barack Obama" you got only French documents at the top
because the US president's name is much more commonly found in English
documents. When you're searching fields with different types of content,
however, you might find examples where you'd actually want idf differences
maintained and documents differentiated based upon underlying field.

One particularly nice thing about the multi-field approach is that it is
very easy to apply different boosts to the fields and to dynamically change
the boosts. You can similarly do this with payloads within a catch-all
field. You could even assign each term a payload corresponding to which
field the content came from, and then dynamically change the boosts
associated with those payloads at query time (caveat - custom code
required). See this blog post for an end-to-end payload scoring example,
https://lucidworks.com/blog/2014/06/13/end-to-end-payload-example-in-solr/.


Sharing my personal experience: at CareerBuilder, we use the catch-all
field with payloads (one per underlying field) that we can dynamically
change the weight of at query time. We found that for most of our corpus
sizes (ranging between 2 and 100 million full text jobs or resumes), that
is is more efficient to search between 1 and 3 fields than to do the
multi-field search with payload scoring, but once we get to the 4th field
the extra cost associated with the payload scoring was overtaken by the
additional time required to search each additional field.   These numbers
(3 vs 4 fields, etc.) are all anecdotal, of course, as it is dependent upon
a lot of environmental and corpus factors unique to our use case.

The main point of this approach, however, is that there is no additional
cost per-field beyond the upfront cost to add and score payloads, so we
have been able to easily represent over a hundred of these payload-based
"virtual fields" with different weights within a catch-all field (all with
a fixed query-time cost).

*In summary*: yes, you should expect a performance decline as you add more
and more fields to your query if you are searching across multiple fields.
You can overcome this by using a single catch-all field if you are okay
losing IDF per-field (you'll still have it globally across all fields). If
you want to use a catch-all field, but still want to boost content based
upon the field it originated within, you can accomplish this with payloads.

All the best,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search & Recommendations @ CareerBuilder


On Mon, Oct 12, 2015 at 9:12 AM, Ahmet Arslan 
wrote:

> Hi,
>
> Catch-all field: No need to worry about how to aggregate scores coming
> from different fields.
> But you cannot utilize different analysers for different fields.
>
> Multiple-fields: You can play with edismax's parameters on-the-fly,
> without having to re-index.
> It is flexible that you can include/exclude fields from search.
>
> Ahmet
>
>
>
> On Monday, October 12, 2015 3:39 PM, elisabeth benoit <
> elisaelisael...@gmail.com> wrote:
> Hello,
>
> We're using solr 4.10 and storing all data in a catchall field. It seems to
> me that one good reason for using a catchall field is when using scoring
> with idf (with idf, a word might not have same score in all fields). We got
> rid of idf and are now considering using multiple fields. I remember
> reading somewhere that using a catchall field might speed up searching
> time. I was wondering if some of you have any opinion (or experience)
> related to this subject.
>
> Best regards,
> Elisabeth
>


Re: are there any SolrCloud supervisors?

2015-10-12 Thread Trey Grainger
I'd be very interested in taking a look if you post the code.

Trey Grainger
Co-Author, Solr in Action
Director of Engineering, Search & Recommendations @ CareerBuilder

On Fri, Oct 2, 2015 at 3:09 PM, r b  wrote:

> I've been working on something that just monitors ZooKeeper to add and
> remove nodes from collections. the use case being I put SolrCloud in
> an autoscaling group on EC2 and as instances go up and down, I need
> them added to the collection. It's something I've built for work and
> could clean up to share on GitHub if there is much interest.
>
> I asked in the IRC about a SolrCloud supervisor utility but wanted to
> extend that question to this list. are there any more "full featured"
> supervisors out there?
>
>
> -renning
>


Re: Lucene Revolution ?

2015-10-18 Thread Trey Grainger
Lucene/Solr Revolution just keeps getting better every year, and this year
was clearly the best year yet!

I saw two major themes that I'd say about about 2/3 of the talk were
focused on:
  1) Search Relevancy
  2) Analytics

I'd definitely say that that there's a greatly emerging landscape of
presentations covering the cutting-edge of search relevancy. Michael
Nilsson and Diego Ceccarelli from Bloomberg gave a presentation on a
Learning to Rank (aka "Machine-Learned Ranking") Solr plugin they are
developing and hoping to open source soon, which I took particular interest
in, as I've got a bit of background there and am working toward developing
something similar over the next few months. In other words, I'm sitting on
the edge of my chair waiting on them to open source it to hopefully save my
team months of similar work : )

Fiona Condon from Etsy also gave a great talk on relevancy from a different
perspective - preventing keyword stuffing/seo gaming/monopoly of their
search results and ensuring uniqueness and fairness in search results in a
system where those contributing the content are all incentivized to game
the system to achieve maximum exposure.

There were also several other relevancy talks I missed, including one from
Simon Hughes from Dice.com on leveraging Latent Semantic Indexing and
Word2Vec to add conceptual search into Solr. This is a topic I remember
being talked about by folks like John Berryman back as early as 2013, but
it looks like Dice released some open source code that can be easily tied
into Solr, which is really exciting to see.  There were many other
presentation on emerging relevancy strategies (sorry if I left your name
off), but I'll have to wait to review the videos with everyone else once
they are posted.

My talk (which Alexandre mentioned earlier) was also on relevancy,
specifically describing building a knowledge graph and intent engine within
Solr that can be used to intelligently parse entities and understand their
relationships dynamically from queries and documents using nothing but the
search index and query logs. (Slides here:
http://www.treygrainger.com/posts/presentations/leveraging-lucene-solr-as-a-knowledge-graph-and-intent-engine/
)

In addition to the many relevancy topic, there was another thread within
the presentations (more committer-driven) around analytics. Specifically,
Tim Potter from LucidWorks (my co-author on Solr in Action) gave a great
presentation on using Spark with Solr, Joel Bernstein and Erick Erickson
gave talks on the recent streaming analytics and parallel computing work
that's being added to Solr, and Yonik Seeley presented on the new JSON
faceting API and the enhanced analytical capabilities therein. Once again,
several other talks on faceting and analytics, but there was quite a strong
committer focus on that topic.

Definitely worth checking out the slides and videos when they are posted -
lots of really good material all around.


Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search & Recommendations @ CareerBuilder



On Sun, Oct 18, 2015 at 7:54 PM, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Here's a bit from my colleague Eric Pugh summarizing Grants keynote.
> Admittedly he's also focussing a lot on our firms relevance
> capabilities/products (the keynote was on relevance) so extensive shameless
> plug warning included with this link :)
>
>
> http://opensourceconnections.com/blog/2015/10/15/bad-behaviors-in-tuning-search-results/
>
> On Sunday, October 18, 2015, Susheel Kumar  wrote:
>
> > I couldn't also make it.  Would love to hear more who make it.
> >
> > Thanks,
> > Susheel
> >
> > On Sun, Oct 18, 2015 at 10:53 AM, Jack Krupansky <
> jack.krupan...@gmail.com
> > >
> > wrote:
> >
> > > Sorry I missed out this year. I thought it was next month and hadn't
> seen
> > > any reminders. Just last Tuesday I finally got around to googling the
> > > conference and was shocked to read that it was the next day. Oh well.
> > > Personally I'm less interested in the formal sessions than the informal
> > > networking.
> > >
> > > In any case, keep those user reports flowing. I'm sure there are plenty
> > of
> > > people who didn't make it to the conference.
> > >
> > > -- Jack Krupansky
> > >
> > > On Sun, Oct 18, 2015 at 8:52 AM, Erik Hatcher  > >
> > > wrote:
> > >
> > > > The Revolution was not televised (though heavily tweeted, and videos
> of
> > > > sessions to follow eventually).  A great time was had by all.  Much
> > > > learning!  Much collaboration. Awesome event if I may say so myself.
> > I'm
> >

Re: Basic Multilingual search capability

2015-02-23 Thread Trey Grainger
Hi Rishi,

I don't generally recommend a language-insensitive approach except for
really simple multilingual use cases (for most of the reasons Walter
mentioned), but the ICUTokenizer is probably the best bet you're going to
have if you really want to go that route and only need exact-match on the
tokens that are parsed. It won't work that well for all languages (CJK
languages, for example), but it will work fine for many.

It is also possible to handle multi-lingual content in a more intelligent
(i.e. per-language configuration) way in your search index, of course.
There are three primary strategies (i.e. ways that actually work in the
real world) to do this:
1) create a separate field for each language and search across all of them
at query time
2) create a separate core per language-combination and search across all of
them at query time
3) invoke multiple language-specific analyzers within a single field's
analyzer and index/query using one or more of those language's analyzers
for each document/query.

These are listed in ascending order of complexity, and each can be valid
based upon your use case. For at least the first and third cases, you can
use index-time language detection to map to the appropriate
fields/analyzers if you are otherwise unaware of the languages of the
content from your application layer. The third option requires custom code
(included in the large Multilingual Search chapter of Solr in Action
<http://solrinaction.com> and soon to be contributed back to Solr via
SOLR-6492 <https://issues.apache.org/jira/browse/SOLR-6492>), but it
enables you to index an arbitrarily large number of languages into the same
field if needed, while preserving language-specific analysis for each
language.

I presented in detail on the above strategies at Lucene/Solr Revolution
last November, so you may consider checking out the presentation and/or
slides to asses if one of these strategies will work for your use case:
http://www.treygrainger.com/posts/presentations/semantic-multilingual-strategies-in-lucenesolr/

For the record, I'd highly recommend going with the first strategy (a
separate field per language) if you can, as it is certainly the simplest of
the approaches (albeit the one that scales the least well after you add
more than a few languages to your queries). If you want to stay simple and
stick with the ICUTokenizer then it will work to a point, but some of the
problems Walter mentioned may eventually bite you if you are supporting
certain groups of languages.

All the best,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search & Recommendations @ CareerBuilder

On Mon, Feb 23, 2015 at 11:14 PM, Walter Underwood 
wrote:

> It isn’t just complicated, it can be impossible.
>
> Do you have content in Chinese or Japanese? Those languages (and some
> others) do not separate words with spaces. You cannot even do word search
> without a language-specific, dictionary-based parser.
>
> German is space separated, except many noun compounds are not
> space-separated.
>
> Do you have Finnish content? Entire prepositional phrases turn into word
> endings.
>
> Do you have Arabic content? That is even harder.
>
> If all your content is in space-separated languages that are not heavily
> inflected, you can kind of do OK with a language-insensitive approach. But
> it hits the wall pretty fast.
>
> One thing that does work pretty well is trademarked names (LaserJet, Coke,
> etc). Those are spelled the same in all languages and usually not inflected.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> On Feb 23, 2015, at 8:00 PM, Rishi Easwaran 
> wrote:
>
> > Hi Alex,
> >
> > There is no specific language list.
> > For example: the documents that needs to be indexed are emails or any
> messages for a global customer base. The messages back and forth could be
> in any language or mix of languages.
> >
> > I understand relevancy, stemming etc becomes extremely complicated with
> multilingual support, but our first goal is to be able to tokenize and
> provide basic search capability for any language. Ex: When the document
> contains hello or здравствуйте, the analyzer creates tokens and provides
> exact match search results.
> >
> > Now it would be great if it had capability to tokenize email addresses
> (ex:he...@aol.com- i think standardTokenizer already does this),
> filenames (здравствуйте.pdf), but maybe we can use filters to accomplish
> that.
> >
> > Thanks,
> > Rishi.
> >
> > -Original Message-
> > From: Alexandre Rafalovitch 
> > To: solr-user 
> > Sent: Mon, Feb 23, 2015 5:49 pm
> > Subject: Re: Basic Multilingual search capability
> >
> >
> > Whic

Re: JSON Facet & Analytics API in Solr 5.1

2015-04-17 Thread Trey Grainger
Agreed, I also prefer the second way. I find it more readible, less verbose
while communicating the same information, less confusing to mentally parse
("is 'terms' the name of my facet, or the type of my facet?..."), and less
prone to syntactlcally valid, but logically invalid inputs.  Let's break
those topics down.

*1) Less verbose while communicating the same information:*
The flatter structure is particularly useful when you have nested facets to
reduce unnecessary verbosity / extra levels. Let's contrast the two
approaches with just 2 levels of subfacets:

** Current Format **
top_genres:{
terms:{
field: genre,
limit: 5,
facet:{
top_authors:{
terms:{
field: author,
limit: 4,
facet: {
top_books:{
terms:{
field: title,
limit: 5
   }
   }
}
}
}
}
}
}

** Flat Format **
top_genres:{
type: terms,
field: genre,
limit: 5,
facet:{
top_authors:{
type: terms
field: author,
limit: 4,
facet: {
top_books:{
type: terms
field: title,
limit: 5
   }
}
}
}
}

The flat format is clearly shorter and more succinct, while communicating
the same information. What value do the extra levels add?


*2) Less confusing to mentally parse*
I also find the flatter structure less confusing, as I'm consistently
having to take a mental pause with the current format to verify whether
"terms" is the name of my facet or the type of my facet and have to count
the curly braces to figure this out.  Not that I would name my facets like
this, but to give an extreme example of why that extra mental calculation
is necessary due to the name of an attribute in the structure being able to
represent both a facet name and facet type:

terms: {
terms: {
field: genre,
limit: 5,
facet: {
terms: {
terms:{
field: author
limit: 4
}
}
}
}
}

In this example, the first "terms" is a facet name, the second "terms" is a
facet type, the third is a facet name, etc. Even if you don't name your
facets like this, it still requires parsing someone else's query mentally
to ensure that's not what was done.

3) *Less prone to syntactically valid, but logically invalid inputs*
Also, given this first format (where the type is indicated by one of
several possible attributes: terms, range, etc.), what happens if I pass in
multiple of the valid JSON attributes... the flatter structure prevents
this from being possible (which is a good thing!):

top_authors : {
terms : {
field : author,
limit : 5
},
range : {
field : price,
start : 0,
end : 100,
gap : 20
}
}

I don't think the response format can currently handle this without adding
in extra levels to make it look like the input side, so this is an
exception case even thought it seems syntactically valid.

So in conclusion, I'd give a strong vote to the flatter structure. Can
someone enumerate the benefits of the current format over the flatter
structure (I'm probably dense and just failing to see them currently)?

Thanks,

-Trey


On Fri, Apr 17, 2015 at 2:28 PM, Jean-Sebastien Vachon <
jean-sebastien.vac...@wantedanalytics.com> wrote:

> I prefer the second way. I find it more readable and shorter.
>
> Thanks for making Solr even better ;)
>
> 
> From: Yonik Seeley 
> Sent: Friday, April 17, 2015 12:20 PM
> To: solr-user@lucene.apache.org
> Subject: Re: JSON Facet & Analytics API in Solr 5.1
>
> Does anyone have any thoughts on the current general structure of JSON
> facets?
> The current general form of a facet command is:
>
>  : {  :  }
>
> For example:
>
> top_authors : { terms : {
>   field : author,
>   limit : 5,
> }}
>
> One alternative I considered in the past is having the type in the args:
>
> top_authors : {
>   type : terms,
>   field : author,
>   limit : 5
> }
>
> It's a flatter structure... probably better in some ways, but worse in
> other ways.
> Thoughts / preferences?
>
> -Yonik
>
>
> On Tue, Apr 14, 2015 at 4:30 PM, Yonik Seeley  wrote:
> > Folks, there's a new JSON Facet API in the just released Solr 5.1
> > (actually, a new facet module under the covers too).
> >
> > It's marked as experimental so we have time to change the API based on
> > your feedback.  So let us know what you like, what you would change,
> > what's missing, or any other ideas you may have!
> >
> > I've just started the documentation for the reference guide (on our
> > confluence wiki),

Re: [ANN] Relevant Search by Manning out! (Thanks Solr community!)

2016-06-21 Thread Trey Grainger
Congrats Doug and John! Writing a book like this is a very long, arduous
process (as several folks on this list can attest to). Writing a great book
like this is considerably more challenging.

I read through this entire book a few months ago before they put the final
touches on it, and (for anyone on the mailing list who is contemplating
buying it), it is a REALLY great book that will teach you the ins and outs
of how search relevancy works under the covers and how you can manipulate
and improve it. It's very well-written, and definitely worth the read.

Congrats again, guys.

Trey Grainger
Co-author, Solr in Action
SVP of Engineering @ Lucidworks

On Tue, Jun 21, 2016 at 2:12 PM, Doug Turnbull <
dturnb...@opensourceconnections.com> wrote:

> Not much more to add than my post here! This book is targeted towards
> Lucene-based search (Elasticsearch and Solr) relevance.
>
> Announcement with discount code:
> http://opensourceconnections.com/blog/2016/06/21/relevant-search-published/
>
> Related hacker news thread:
> https://news.ycombinator.com/item?id=11946636
>
> Thanks to everyone in the Solr community that was helpful to my efforts.
> Specifically Trey Grainger, Eric Pugh (for keeping me employed), Charlie
> Hull and the Flax team, Alex Rafalovitch, Timothy Potter, Yonik Seeley,
> Grant Ingersoll (for basically teaching me Solr back in the day), Drew
> Farris (for encouraging my early blogging), everyone at OSC, and many
> others I'm probably forgetting!
>
> Best
> -Doug
>


Re: Hackday next month

2016-09-21 Thread Trey Grainger
I know a bunch of folks who would be likely attend the hackday (including
committers) will have some other meetings on Wednesday before the
conference, so I think that Tuesday is actually a pretty good time to have
this.

My 2 cents,

Trey Grainger
SVP of Engineering @ Lucidworks
Co-author, Solr in Action

On Wed, Sep 21, 2016 at 1:20 PM, Anshum Gupta 
wrote:

> This is good but is there a way to instead do this on Wednesday?
> Considering that the conference starts on Thursday, perhaps it makes sense
>  to do it just a day before ? Not sure about others but it certainly would
> work much better for me.
>
> -Anshum
>
> On Wed, Sep 21, 2016 at 2:18 PM Charlie Hull  wrote:
>
> > Hi all,
> >
> > If you're coming to Lucene Revolution next month in Boston, we're
> > running a Lucene-focused hackday (Lucene, Solr, Elasticsearch)
> > kindly hosted by BA Insight. There will be Lucene committers there, it's
> > free to attend and we also need ideas on what to do! Come and join us.
> >
> > http://www.meetup.com/New-England-Search-Technologies-
> NEST-Group/events/233492535/
> >
> > Cheers
> >
> > Charlie
> >
> > --
> > Charlie Hull
> > Flax - Open Source Enterprise Search
> >
> > tel/fax: +44 (0)8700 118334
> > mobile:  +44 (0)7767 825828
> > web: www.flax.co.uk
> >
>


Re: Related Search

2016-10-26 Thread Trey Grainger
Yeah, the approach listed by Grant and Markus is a common approach. I've
worked on systems that mined query logs like this, and it's a good approach
if you have sufficient query logs to pull it off.

There are a lot of linguistic nuances you'll encounter along the way,
including how you disambiguate homonyms and their related terms, identify
synonyms/acronyms as having the same underlying meaning, how you parse and
handle unknown phrases, removing noise present in the query logs, and even
how you weight the strength or relationship between related queries. I gave
a presentation on this topic at Lucene/Solr Revolution in 2015 if you're
interested in learning more about how to build such a system (
http://www.treygrainger.com/posts/presentations/leveraging-lucene-solr-as-a-knowledge-graph-and-intent-engine/
).

Another approach (also referenced in the above presentation), for those
with more of a cold-start problem with query logs, is to mine related terms
and phrases out of the underlying content in the search engine (inverted
index) itself. The Semantic Knowledge Graph that was recently open sourced
by CareerBuilder and contributed back to Solr (disclaimer: I worked on it,
and it's available both a Solr plugin and patch, but it's not ready to be
committed into Solr yet.) enables such a capability. See
https://issues.apache.org/jira/browse/SOLR-9480 for the most current patch.

It is a request handler that can take in any query and discover the most
related other terms to that entire query from the inverted index, sorted by
strength of relationship to that query (it can also traverse from those
terms across fields/relationships to other terms, but that's probably
overkill for the basic related searches use case). Think of it as a way to
run a query and find the most relevant other keywords, as opposed to
finding the most relevant documents.

Using this, you can then either return the related keywords as your related
searches, or you can modify your query to include them and power a
conceptual/semantic search instead of the pure text-based search you
started with. It's effectively a (better) way to implement More Like This,
where instead of taking a document and using tf-idf to extract out the
globally-interesting terms from the document (like MLT), you can instead
use a query to find contextually-relevant keywords across many documents,
score them based upon their similarity to the original query, and then turn
around and use the top most semantically-relevant terms as your related
search(es).

I don't have near-term plans to expose the semantic knowledge graph as a
search component (it's a request handler right now), but once it's finished
that could certainly be done. Just wanted to mention it as another approach
to solve this specific problem.

-Trey Grainger
SVP of Engineering @ Lucidworks
Co-author, Solr in Action



On Wed, Oct 26, 2016 at 1:59 PM, Markus Jelsma 
wrote:

> Indeed, we have similar processes running of which one generates a
> 'related query collection' which just contains a (normalized) query and its
> related queries. I would not know how this is even possible without
> continuously processing query and click logs.
>
> M.
>
>
> -Original message-
> > From:Grant Ingersoll 
> > Sent: Tuesday 25th October 2016 23:51
> > To: solr-user@lucene.apache.org
> > Subject: Re: Related Search
> >
> > Hi Rick,
> >
> > I typically do this stuff just by searching a different collection that I
> > create offline by analyzing query logs and then indexing them and
> searching.
> >
> > On Mon, Oct 24, 2016 at 8:32 PM Rick Leir  wrote:
> >
> > > Hi all,
> > >
> > > There is an issue 'Create a Related Search Component' which has been
> > > open for some years now.
> > >
> > > It has a priority: major.
> > >
> > > https://issues.apache.org/jira/browse/SOLR-2080
> > >
> > >
> > > I discovered it linked from Lucidwork's very useful blog on ecommerce:
> > >
> > >
> > > https://lucidworks.com/blog/2011/01/25/implementing-the-
> ecommerce-checklist-with-apache-solr-and-lucidworks/
> > >
> > >
> > > Did people find a better way to accomplish Related Search? Perhaps MLT
> > > http://wiki.apache.org/solr/MoreLikeThis ?
> > >
> > > cheers -- Rick
> > >
> > >
> > >
> >
>


Re: Semantic Knowledge Graph

2017-10-09 Thread Trey Grainger
Hi David, that's my fault. I need to do a final proofread through them
before they get posted (and may have to push one quick code change, as
well). I'll try to get that done within the next few days.

All the best,

Trey Grainger
SVP of Engineering @ Lucidworks
Co-author, Solr in Action <http://solrinaction.com>
http://www.treygrainger.com


On Mon, Oct 9, 2017 at 10:14 AM, David Hastings <
hastings.recurs...@gmail.com> wrote:

> Hey All, slides form the 2017 lucene revolution were put up recently, but
> unfortunately, the one I have the most interest in, the semantic knowledge
> graph, have not been put up:
>
> https://lucenesolrrevolution2017.sched.com/event/BAwX/the-
> apache-solr-semantic-knowledge-graph?iframe=no&w=100%&sidebar=yes&bg=no
>
>
> dont suppose any one knows where i may be able to find them, or point me in
> a direction to get more information about this tool.
>
> Thanks - dave
>


Re: Disabling XmlQParserPlugin through solrconfig

2017-10-12 Thread Trey Grainger
You can also just "replace" the registered xml query parser with another
parser. I imagine you're doing this for security reasons, which means you
just want the actual xml query parser to not be executable through a query.
Try adding the following line to your solrconfig.xml:


This way, the xml query parser is loaded in as a version of the eDismax
query parser instead, and any queries the are trying to reference the xml
query parser through local params will instead hit the eDismax query parser
and use its parsing logic instead.

All the best,

Trey Grainger
SVP of Engineering @ Lucidworks
Co-author, Solr in Action <http://solrinaction.com/>
http://www.treygrainger.com

-

On Thu, Oct 12, 2017 at 6:56 PM, Shawn Heisey  wrote:

> On 10/12/2017 3:18 PM, Manikandan Sivanesan wrote:
>
>> I'm looking for a way to disable the query parser XmlQParserPlugin
>> (org.apache.solr.search.XmlQParserPlugin) through solrconfig.xml .
>> Following the instructions mentioned here
>> <https://wiki.apache.org/solr/SolrConfigXml#Enable.2Fdisable_components>
>> to
>> disable a query parser.
>>
>> This is the part that I added to solrconfig.
>> > enable="{enable.xmlparser:false}/>
>>
>> I have uploaded it to zk and reloaded the collection. But I still see the
>> XmlQParserPlugin loaded in
>> in the Plugin/Stats => QUERYPARSER section of Solr Admin Console.
>>
>
> Through experimentation, I was able to figure out that the configuration
> of query parsers DOES support the "enable" attribute.  Initially I thought
> it might not.
>
> With this invalid configuration (the class is missing a character), Solr
> will start correctly:
>
> 
>
> But if I change the enable attribute to "true" instead of "false", Solr
> will NOT successfully load the core with that config, because it contains a
> class that cannot be found.
>
> The actual problem you're running into is that almost every query parser
> implementation that Solr has is hard-coded and explicitly loaded by code in
> QParserPlugin.  One of those parsers is the XML parser that you want to
> disable.
>
> I think it would be a good idea to go through the list of hard-coded
> parsers in the QParserPlugin class and make it a MUCH smaller list.  Some
> of the parsers, especially the XML parser, probably should require explicit
> configuration rather than being included by default.
>
> Thanks,
> Shawn
>
>


Re: "on deck" searcher vs warming searcher

2016-12-09 Thread Trey Grainger
Shawn and Joel both answered the question with seemingly opposite answers,
but Joel's should be right. On Deck, as an idiom, means "getting ready to
go next". I think it has it's history in military / naval terminology (a
plane being "on deck" of an aircraft carrier was the next one to take off),
and was later used heavily in baseball (the "on deck" batter was the one
warming up to go next) and probably elsewhere.

I've always understood the "on deck" searcher(s) being the same as the
warming searcher(s). So you have the "active" searcher and them the warming
or on deck searchers.

-Trey


On Fri, Dec 9, 2016 at 11:54 AM, Erick Erickson 
wrote:

> Jihwan:
>
> Correct. Do note that there are two distinct warnings here:
> 1> "Error opening new searcher. exceeded limit of maxWarmingSearchers"
> 2> "PERFORMANCE WARNING: Overlapping onDeckSearchers=..."
>
> in <1>, the new searcher is _not_ opened.
> in <2>, the new searcher _is_ opened.
>
> In practice, getting either warning is an indication of
> mis-configuration. Consider a very large filterCache with large
> autowarm values. Every new searcher will then allocate space for the
> filterCache so having <1> is there to prevent runaway situations that
> lead to OOM errors.
>
> <2> is just letting you know that you should look at your usage of
> commit so you can avoid <1>.
>
> Best,
> Erick
>
> On Fri, Dec 9, 2016 at 8:44 AM, Jihwan Kim  wrote:
> > why is there a setting (maxWarmingSearchers) that even lets you have more
> > than one:
> > Isn't it also for a case of (frequent) update? For example, one update is
> > committed.  During the warming up  for this commit, another update is
> > made.  In this case the new commit also go through another warming.  If
> the
> > value is 1, the second warming will fail.  More number of concurrent
> > warming-up requires larger memory usage.
> >
> >
> > On Fri, Dec 9, 2016 at 9:14 AM, Erick Erickson 
> > wrote:
> >
> >> bq: because shouldn't there only be one active
> >> searcher at a time?
> >>
> >> Kind of. This is a total nit, but there can be multiple
> >> searchers serving queries briefly (one hopes at least).
> >> S1 is serving some query when S2 becomes
> >> active and starts getting new queries. Until the last
> >> query S1 is serving is complete, they both are active.
> >>
> >> bq: why is there a setting
> >> (maxWarmingSearchers) that even lets
> >> you have more than one
> >>
> >> The contract is that when you commit (assuming
> >> you're opening a new searcher), then all docs
> >> indexed up to that point are visible. Therefore you
> >> _must_ open a new searcher even if one is currently
> >> warming or that contract would be violated. Since
> >> warming can take minutes, not opening a new
> >> searcher if one was currently warming could cause
> >> quite a gap.
> >>
> >>
> >> Best,
> >> Erick
> >>
> >> On Fri, Dec 9, 2016 at 7:30 AM, Brent  wrote:
> >> > Hmmm, conflicting answers. Given the infamous "PERFORMANCE WARNING:
> >> > Overlapping onDeckSearchers" log message, it seems like the "they're
> the
> >> > same" answer is probably correct, because shouldn't there only be one
> >> active
> >> > searcher at a time?
> >> >
> >> > Although it makes me curious, if there's a warning about having
> multiple
> >> > (overlapping) warming searchers, why is there a setting
> >> > (maxWarmingSearchers) that even lets you have more than one, or at
> least,
> >> > why ever set it to anything other than 1?
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context: http://lucene.472066.n3.
> >> nabble.com/on-deck-searcher-vs-warming-searcher-tp4309021p4309080.html
> >> > Sent from the Solr - User mailing list archive at Nabble.com.
> >>
>


Re: analyzer with multiple stem-filters for more languages

2014-03-14 Thread Trey Grainger
I wouldn't recommend putting multiple stemmers in the same Analyzer. Like
Jack said, the second stemmer could take the results of the first stemmer
and "stem the stem", wreaking all kinds of havoc on the resulting terms.

Since the Stemmers replace the original word, running two of them in
sequence will mean the second stemmer never sees the original input in
cases where the first stemmer modified it. Also, many languages require
multiple different CharFilters and TokenFilters (some for accent
normalization, some for stopwords and/or synonyms, some for stemming,
etc.), so it will get VERY complicated trying to safely coordinate when
each token filter runs... probably impossible for many language
combinations.

What you CAN do, however, is define multiple language-specific Analyzers
and then invoke both Analyzers separately within your field, stacking the
resulting tokens from each Analyzer's outputted token stream according to
their position increments. Think of it as having sub-fields within a field,
where each sub-field has it's own dedicated Analyzer.

Shameless plug: We cover how to do this (and provide the sample code) in
the "Multilingual Search" chapter of *Solr in Action
<http://solrinaction.com>*, the new book from Manning Publications that is
being to be released within the next few days. The source code is all
publicly available, though, if want to get an idea of how this works:
https://github.com/treygrainger/solr-in-action/tree/master/src/main/java/sia/ch14

Of course, if you want to take a simpler route, you can always just copy
your text to two separate fields (one per language) and then search across
them at query time using the eDisMax query parser. There are pros and cons
to both approaches.

All the best,

-Trey Grainger




On Fri, Mar 14, 2014 at 8:00 PM, Jack Krupansky wrote:

> You would have to carefully analyze the source code and tables of these
> two stemmers to determine if one might incorrectly stem words in the other
> language. Technically, that could be fine for indexing, but it might give
> users some unexpected results for queries. There might also be cases where
> the second stemmer would stem a term that was already stemmed by the first
> stemmer.
>
> You could avoid the latter issue by using the duplicate token technique.
> For a single stemmer this is generally:
>
> 
> 
> 
> 
> 
>
> For two (or more) languages:
>
> 
> 
> 
> 
>
> 
> 
>
> This would produce the stemmed term for both languages, or either
> language, or neither, as the case may be.
>
> -- Jack Krupansky
>
> -Original Message- From: Croci Francesco Luigi (ID SWS)
> Sent: Friday, March 14, 2014 8:17 AM
> To: solr-user@lucene.apache.org
> Subject: analyzer with multiple stem-filters for more languages
>
>
> It is possible to define an analyzer with more than one Stem-filter for
> more languages?
>
> Something like this:
>
> 
>...
>   (default for english)
> 
> 
>
> Greetings
> Francesco
>


[ANN] Solr in Action book release (Solr 4.7)

2014-03-27 Thread Trey Grainger
I'm excited to announce the final print release of *Solr in Action*, the
newest Solr book by Manning publications covering through Solr 4.7 (the
current version). The book is available for immediate purchase in print and
ebook formats, and the *outline*, some *free chapters* as well as the *full
source code are also available* at http://solrinaction.com.

I would love it if you would check the book out, and I would also
appreciate your feedback on it, especially if you find the book to be a
useful guide as you are working with Solr! Timothy Potter and I (Trey
Grainger) worked tirelessly on the book for nearly 2 years to bring you a
thorough (664 pg.) and fantastic example-driven guide to the best Solr has
to offer.

*Solr in Action* is intentionally designed to be a learning guide as
opposed to a reference manual. It builds from an initial introduction to
Solr all the way to advanced topics such as implementing a predictive
search experience, writing your own Solr plugins for function queries and
multilingual text analysis, using Solr for big data analytics, and even
building your own Solr-based recommendation engine. The book uses fun
real-world examples, including analyzing the text of tweets, searching and
faceting on restaurants, grouping similar items in an ecommerce
application, highlighting interesting keywords in UFO sighting reports, and
even building a personalized job search experience.

For a more detailed write-up about the book and it's contents, you can also
visit the Solr homepage at
https://lucene.apache.org/solr/books.html#solr-in-action. Thanks in advance
for checking it out, and I really hope many of you find the book to be
personally useful!

All the best,

Trey Grainger
Co-author,
*Solr in Action*Director of Engineering, Search & Analytics @CareerBuilder


Re: [ANN] Solr in Action book release (Solr 4.7)

2014-03-27 Thread Trey Grainger
Hi Philippe,

Yes if you've purchased the eBook then the PDF is available now and the
other formats (ePub and Kindle) are supposed to be available for download
on April 8th.
It's also worth mentioning that the eBook formats are all available for
free with the purchase of the print book.

Best regards,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search & Analytics @CareerBuilder


On Thu, Mar 27, 2014 at 12:04 PM, Philippe Soares 
wrote:
>
> Thanks Trey !
> I just tried to download my copy from my manning account, and this final
> version appears only in PDF format.
> Any idea about when they'll release the other formats ?


Re: Multiple Languages in Same Core

2014-03-27 Thread Trey Grainger
In addition to the two approaches Liu Bo mentioned (separate core per
language and separate field per language), it is also possible to put
multiple languages in a single field. This saves you the overhead of
multiple cores and of having to search across multiple fields at query
time. The idea here is that you can run multiple analyzers (i.e. one for
German, one for English, one for Chinese, etc.) and stack the outputted
TokenStreams for each of these within a single field. It is also possible
to swap out the languages you want to use on a case-by-case basis (i.e.
per-document, per field, or even per word) if you really need to for
advanced use cases.

All three of these methods, including code examples and the pros and cons
of each are discussed in the Multilingual Search chapter of Solr in Action,
which Alexandre referenced. If you don't have the book, you can also just
download and run the code examples for free, though they may be harder to
follow without the context from the book.

Thanks,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search & Analytics @CareerBuilder





On Wed, Mar 26, 2014 at 4:34 AM, Liu Bo  wrote:

> Hi Jeremy
>
> There're a lot of multi language discussions, two main approaches
>  1. like yours, a language is one core
>  2. all in one core, different language has it's own field.
>
> We have multi-language support in a single core, each multilingual field
> has it's own suffix such as name_en_US. We customized query handler to hide
> the query details to client.
> The main reason we want to do this is about NRT index and search,
> take product for example:
>
> product has price, quantity which is common and it's used by filtering
> and sorting, name, description is multi language field,
> if we split product in do different cores, the common field updating
> may end up a update in all of the multi language cores.
>
> As to scalability, we don't change solr cores/collections when a new
> language is added, but we probably need update our customized index process
> and run a full re-index.
>
> This approach suits our requirement for now, but you may have your own
> concerns.
>
> We have similar "suggest filter" problem like yours, we want to return
> suggest result filtering by stores. I can't find a way to build dictionary
> with query at my version of solr 4.6
>
> What I do is run a query on a N-Gram analyzed field and with filter queries
> on store_id field. The "suggest" is actually a query. It may not perform as
> well as suggestion but can do the trick.
>
> You can try it to build a additional N-GRAM field for suggestion only and
> search on it with fq on your "Locale" field.
>
> All the best
>
> Liu Bo
>
>
>
>
> On 25 March 2014 09:15, Alexandre Rafalovitch  wrote:
>
> > Solr In Action has a significant discussion on the multi-lingual
> > approach. They also have some code samples out there. Might be worth a
> > look
> >
> > Regards,
> >Alex.
> > Personal website: http://www.outerthoughts.com/
> > LinkedIn: http://www.linkedin.com/in/alexandrerafalovitch
> > - Time is the quality of nature that keeps events from happening all
> > at once. Lately, it doesn't seem to be working.  (Anonymous  - via GTD
> > book)
> >
> >
> > On Tue, Mar 25, 2014 at 4:43 AM, Jeremy Thomerson
> >  wrote:
> > > I recently deployed Solr to back the site search feature of a site I
> work
> > > on. The site itself is available in hundreds of languages. With the
> > initial
> > > release of site search we have enabled the feature for ten of those
> > > languages. This is distributed across eight cores, with two Chinese
> > > languages plus Korean combined into one CJK core and each of the other
> > > seven languages in their own individual cores. The reason for splitting
> > > these into separate cores was so that we could have the same field
> names
> > > across all cores but have different configuration for analyzers, etc,
> per
> > > core.
> > >
> > > Now I have some questions on this approach.
> > >
> > > 1) Scalability: Considering I need to scale this to many dozens more
> > > languages, perhaps hundreds more, is there a better way so that I don't
> > end
> > > up needing dozens or hundreds of cores? My initial plan was that many
> > > languages that didn't have special support within Solr would simply get
> > > lumped into a single "default" core that has some default analyzers
> that
> > > are applicable to the majority of languages.
> > >
> > &g

Re: multiple analyzers for one field

2014-04-10 Thread Trey Grainger
Hi Michael,

It IS possible to utilize multiple Analyzers within a single field, but
it's not a "built in" capability of Solr right now. I wrote something I
called a "MultiTextField" which provides this capability, and you can see
the code here:
https://github.com/treygrainger/solr-in-action/tree/master/src/main/java/sia/ch14

The general idea is that you can pass in a prefix for each piece of your
content and then use that prefix to dynamically select one or more
Analyzers for each piece of content. So, for example, you could pass in
something like this when indexing your document (for a multiValued field):
en|some text
es|some more text
de,fr|some other text

Then, the MultiTextField will parse the prefixes and dynamically grab an
Analyzer based upon the prefix. In this case, the first input will be
processed using an English Analyzer, the second input will use a spanish
analyzer, and the third input will use both a German and French analyzer,
as defined when the field is defined in the schema.xml:






If you want to automagically map separate fields into one of these dynamic
analyzer (MultiText) fields with prefixes, you could either pass the text
in multiple times from the client to the same field (with different
Analyzer prefixes each time like shown above), OR you could write an Update
Request Processor that does this for you. I don't think it is possible to
just have the copyField add in prefixes automatically for you, though
someone please correct me if I'm wrong.

If you implement an Update Request Processor, then inside it you would
simply grab the text from each of the relevant fields (i.e. author and
title fields) and then add that field's value to the named MultiText field
with the appropriate Analyzer prefix based upon each field. I made an
example Update Request Processor (see the previous github link and look for
MultiTextFieldLanguageIdentifierUpdateProcessor) that you could look at as
an example of how to supply different analyzer prefixes to different values
within a multiValued field, though you would obviously want to throw away
all the language detection stuff since it doesn't match your specific use
case.

All that being said, this solution may end up being overly complicated for
your use case, so your idea of creating a custom analyzer to just handle
your example might be much less complicated. At any rate, that's the
specific answer to your specific question about whether it is possible to
utilize multiple Analyzers within a field based upon multiple inputs.

All the best,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search & Analytics @ CareerBuilder


On Thu, Apr 10, 2014 at 9:05 PM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> The lack of response to this question makes me think that either there is
> no good answer, or maybe the question was too obtuse.  So I'll give it one
> more go with some more detail ...
>
> My main goal is to implement autocompletion with a mix of words and short
> phrases, where the words are drawn from the text of largish documents, and
> the phrases are author names and document titles.
>
> I think the best way to accomplish this is to concoct a single field that
> contains data from these other "source" fields (as usual with copyField),
> but with some of the fields treated as keywords (ie with their values
> inserted as single tokens), and others tokenized.  I believe this would be
> possible at the Lucene level by calling Document.addField () with multiple
> fields having the same name: some marked as TOKENIZED and others not.  I
> think the tokenized fields would have to share the same analyzer, but
> that's OK for my case.
>
> I can't see how this could be made to happen in Solr without a lot of
> custom coding though. It seems as if the conversion from Solr fields to
> Lucene fields is not an easy thing to influence.  If anyone has an idea how
> to achieve the subgoal, or perhaps a different way of getting at the main
> goal, I'd love to hear about it.
>
> So far my only other idea is to write some kind of custom analyzer that
> treats short texts as keywords and tokenizes longer ones, which is probably
> what I'll look at if nothing else comes up.
>
> Thanks
>
> Mike
>
>
>
> On 4/9/2014 4:16 PM, Michael Sokolov wrote:
>
>> I think I would like to do something like copyfield from a bunch of
>> fields into a single field, but with different analysis for each source,
>> and I'm pretty sure that's not a thing. Is there some alternate way to
>> accomplish my goal?
>>
>> Which is to have a suggester that suggests words from my full text field
>> and complete phrases drawn from my author and title fields all at the same
>> time.  So If I could index author and title using KeyWordAnalyzer, and full
>> text tokenized, that would be the bees knees.
>>
>> -Mike
>>
>
>


Re: facet.field counts when q includes field

2014-04-27 Thread Trey Grainger
>>So my question basically is: which restrictions are applied to the docset
from which (field) facets are computed?

Facets are generated based upon values found within the documents matching
your "q=" parameter and also all of your "fq=" parameters. Basically, if
you do an intersection of the docsets from all "q=" and "fq=" parameters
then you end up with the docset the facet calculations are based upon.

When you say "if I add type=book, *no* documents match, but I get facet
counts: { chapter=4 }", I'm not exactly sure what you mean. If you are
adding "q=toto&type=book&facet=true&facet.field=type" then the problem is
that the "type=book" parameter doesn't do anything... it is not a valid
Solr parameter for filtering here. In this case, all 4 of your documents
matching the "q=toto" query are still being returned, which is why the
facet count for chapters is 4.

If instead you specify "q=toto&fq=type:book&facet=true&facet.field=type"
then this will filter down to ONLY the documents with a type of book. Since
it looks like in your data there are no documents which are both a type of
book and also match the "q=toto" query, you should get 0 documents and thus
the counts of all your facet values will be zero.

As you mentioned, it is possible to utilize tags and excludes to change the
behavior described above, but hopefully this answers your question about
the default behavior.

Thanks,

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search & Analytics @ CareerBuilder


On Sun, Apr 27, 2014 at 4:51 PM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> I'm trying to understand the facet counts I'm getting back from Solr when
> the main query includes a term that restricts on a field that is being
> faceted.  After reading the docs on the wiki (both wikis) I'm confused.
>
> In my little test dataset, if I facet on "type" and use q=*:*, I get facet
> counts for type: [ chapter=5, book=1 ]
>
> With q=toto, only four of the chapters match, so I get facet counts for
> type: { chapter=4 } .
>
> Now if I add type=book, *no* documents match, but I get facet counts: {
> chapter=4 }.
>
> It's as if the type term from the query is being ignored when the facets
> are computed.  This is actually what we want, in general, but the
> documentation doesn't reflect it and I'd like to understand better the
> mechanism so I can tell what I can rely on.
>
> I see that there is the possibility of tagging and excluding filters (fq)
> so they don't effect the facet counting, but there's no mention on the wiki
> of any sort of term exclusion from the main query.  I poked around in the
> source a bit, but wasn't able to find an answer quickly, so I thought I'd
> ask here.
>
> So my question basically is: which restrictions are applied to the docset
> from which (field) facets are computed?
>
> -Mike
>
>
>


Re: facet.field counts when q includes field

2014-04-27 Thread Trey Grainger
No problem, Mike. Glad you got it sorted out.

Trey Grainger
Co-author, Solr in Action
Director of Engineering, Search & Analytics @ CareerBuilder


On Sun, Apr 27, 2014 at 7:23 PM, Michael Sokolov <
msoko...@safaribooksonline.com> wrote:

> On 4/27/14 7:02 PM, Michael Sokolov wrote:
>
>> On 4/27/2014 6:30 PM, Trey Grainger wrote:
>>
>>> So my question basically is: which restrictions are applied to the docset
>>>>>
>>>> from which (field) facets are computed?
>>>
>>> Facets are generated based upon values found within the documents
>>> matching
>>> your "q=" parameter and also all of your "fq=" parameters. Basically, if
>>> you do an intersection of the docsets from all "q=" and "fq=" parameters
>>> then you end up with the docset the facet calculations are based upon.
>>>
>>> When you say "if I add type=book, *no* documents match, but I get facet
>>> counts: { chapter=4 }", I'm not exactly sure what you mean. If you are
>>> adding "q=toto&type=book&facet=true&facet.field=type" then the problem
>>> is
>>> that the "type=book" parameter doesn't do anything... it is not a valid
>>> Solr parameter for filtering here. In this case, all 4 of your documents
>>> matching the "q=toto" query are still being returned, which is why the
>>> facet count for chapters is 4.
>>>
>> In fact my query looks like:
>>
>> q=fulltext_t%3A%28toto%29+AND+dc_type_s%3A%28book%29+%
>> 2Bdirectory_b%3Afalse&start=0&rows=20&fl=uri%2Ctimestamp%
>> 2Cdirectory_b%2Csize_i%2Cmeta_ss%2Cmime_type_ss&facet.field=dc_type_s
>>
>> or without url encoding:
>>
>>  q=fulltext_t:(toto) AND dc_type_s:(book) (directory_b:false)
>> facet.field=dc_type_s
>>
>> default operator is AND
>>
>>  ... so I don't think that the query is broken like you described?
>>
>> -Mike
>>
> OK the problem wasn't with the query, but while I tried to write out a
> clearer explanation, I found it -- an issue in a unit test too boring to
> describe.  Facets do seem to work like you said, and how they're
> documented, and as I assumed they did :)
>
> Thanks, and sorry for the noise.
>
> -Mike
>


Re: Single multilingual field analyzed based on other field values

2013-10-28 Thread Trey Grainger
Hi David,

What version of the Solr in Action MEAP are you looking at (current version
is 12, and version 13 is coming out later this week, and prior versions had
significant bugs in the code you are referencing)?  I added an update
processor in the most recent version that can do language identification
and prepend the language codes for you (even removing them from the stored
version of the field and only including them on the indexed version for
text analysis).

You could easily modify this update processor to read the value from the
language field and use it as the basis of the pre-pended languages.

Otherwise, if you want to do language detection instead of passing in the
language manually, MultiTextField in chapter 14 of Solr in Action and the
corresponding MultiTextFieldLanguageIdentifierUpdateProcessor should handle
all of the language detection and pre-pending automatically for you (and
also append the identified language to a separate field).

If it were easy/possible to have access to the rest of the fields in the
document from within a field's Analyzer then I would have certainly opted
for that approach instead of the whole pre-pending languages to content
option.  If it is too cumbersome, you could probably rewrite the
MultiTextField to pull the language from the field name instead of the
content  (i.e.  blah, blah instead of
en,fr|blah, blah as currently designed).
 This would make specifying the language much easier (especially at query
time since you only have to specify the languages once instead of on each
term), and you could have Solr still search the same underlying field for
all languages.  Same general idea, though.

In terms of your ThreadLocal cache idea... that sounds really scary to me.
 The Analyzers' TokenStreamComponents are cached in a ThreadLocal context
depending upon to the internal ReusePolicy, and I'm skeptical that you'll
be able to pull this off cleanly.  It would really be hacking around the
Lucene API's even if you were able to pull it off.

-Trey


On Mon, Oct 28, 2013 at 5:15 PM, Jack Krupansky wrote:

> Consider an update processor - it can operate on any field and has access
> to all fields.
>
> You could have one update processor to combine all the fields to process,
> into a temporary, dummy field. Then run a language detection update
> processor on the combined field. Then process the results and place in the
> desired field. And finally remove any temporary fields.
>
> -- Jack Krupansky
> -Original Message- From: David Anthony Troiano
> Sent: Monday, October 28, 2013 4:47 PM
> To: solr-user@lucene.apache.org
> Subject: Single multilingual field analyzed based on other field values
>
>
> Hello,
>
> First some background...
>
> I am indexing a multilingual document set where documents themselves can
> contain multiple languages.  The language(s) within my documents are known
> ahead of time.  I have tried separate fields per language, and due to the
> poor query performance I'm seeing with that approach (many languages /
> fields), I'm trying to create a single multilingual field.
>
> One approach to this problem is given in Section
> 14.6.4 0B3NlE_uL0pqwR0hGV0M1QXBmZm8/**edit
> >of
> the new Solr In Action book.  The approach is to take the document
> content field and prepend it with the list contained languages followed by
> a special delimiter.  A new field type is defined that maps languages to
> sub field types, and the new type's tokenizer then runs all of the sub
> field type analyzers over the field and merges results, adjusts offsets for
> the prepended data, etc.
>
> Due to the tokenizer complexity incurred, I'd like to pursue a more
> flexible approach, which is to run the various language-specific analyzers
> not based on prepended codes, but instead based on other field values
> (i.e., a language field).
>
> I don't see a straightforward way to do this, mostly because a field
> analyzer doesn't have access to the rest of the document.  On the flip
> side, an UpdateRequestProcessor would have access to the document but
> doesn't really give a path to wind up where I want to be (single field with
> different analyzers run dynamically).
>
> Finally, my question: is it possible to thread cache document language(s)
> during UpdateRequestProcessor execution (where we have access to the full
> document), so that the analyzer can then read from the cache to determine
> which analyzer(s) to run?  More specifically, if a document is run through
> it's URP chain on thread T, will its analyzer(s) also run on thread T and
> will no other documents be run through the URP on that thread in the
> interim?
>
> Thanks,
> Dave
>


Re: LanguageIdentifierUpdateProcessor uses only firstValue() on multivalued fields

2013-11-28 Thread Trey Grainger
Yeah, the documentation is definitely wrong - it definitely doesn't
concatenate the values in a multivalued field, it only uses the first one
like you mentioned.

If you want to detect the language of each of the values in the
multi-valued field (as opposed to specifying multiple separate string
values), however, this is easy enough to accomplish by modifying the code
in the language detect update processor to loop through each of the values:

LinkedHashSet langsToPrepend = new LinkedHashSet();
for (final Object inputValue : inputField.getValues()) {
 Object outputValue = inputValue;
 List fieldValueLangs = null;
  if (inputValue instanceof String){
   fieldValueLangs = this.detectLanguage(inputValue.toString());
  }

 for (DetectedLanguage lang : fieldValueLangs){

 langsToPrepend.add(lang.getLangCode());
}
}

The "langsToPrepend" variable above will contain a set of languages,
where detectLanguage was called separately for each value in the
multivalued field.  If you just want to concatenate all the values and
detect languages once (as opposed to only using the first value in the
multivalued field, like it does today), just concatenate each of the
input values in the first loop and call detectLanguage once at the
end.

I wrote code that does this for an example in the Solr in Action book.
 The particular example was detecting languages for each value in a
multivalued field and then pre-pending the language to the text for
the multivalued field (so the analyzer would know which stemmer to
use, as they were being dynamically substituted in based upon the
language).  The code is available here if you are interested:
https://github.com/treygrainger/solr-in-action/blob/master/src/main/java/sia/ch14/MultiTextFieldLanguageIdentifierUpdateProcessor.java

Good luck!

-Trey




On Wed, Nov 27, 2013 at 10:16 AM, Müller, Stephan <
muel...@ponton-consulting.de> wrote:

> > I suspect that it is an oversight for a use case that was not considered.
> > I mean, it should probably either ignore or convert non text/string
> > values.
> Ok, I'll see that I provide a patch against trunk. It actually
> ignores non string values, but is unable to check the remaining values
> of a multivalued field.
>
> > Hmmm... are you using JSON input? I mean, how are the types being set?
> > Solr XML doesn't have a way to set the value types.
> >
> No. It's a field with multivalued=true. That results in a SolrInputField
> where value (which is defined to be Object) actually holds a List.
> This list is populated with Integer, String, Date, you name it.
> I'm talking about the actual Java-Datatypes. The values in the list are
> probably set by this 3rdparty Textbodyprocessor thingy.
>
> Now the Language processor just asks for field.getValue().
> This is delegated to the SolrInputField which in turn calls firstValue()
> Interestingly enough, already is able to handle a Collection as its value.
> But if the value is a collection, it just returns the first element.
>
> > You could workaround it with an update processor that copied the field
> and
> > massaged the multiple values into what you really want the language
> > detection to see. You could even implement that processor as a JavaScript
> > script with the stateless script update processor.
> >
> Our workaround would be to not feed the multivalued field but only the
> String fields (which are also included in the multivalued field)
>
>
> Filing a Bug/Feature request and providing the patch will take some time
> as I haven't setup a fully working trunk in my IDEA installation.
> But I'm eager to do it :)
>
> Regards,
> Stephan
>
>
> > -- Jack Krupansky
> >
> > -Original Message-
> > From: Müller, Stephan
> > Sent: Wednesday, November 27, 2013 5:02 AM
> > To: solr-user@lucene.apache.org
> > Subject: LanguageIdentifierUpdateProcessor uses only firstValue() on
> > multivalued fields
> >
> > Hello,
> >
> > this is a repost. This message was originally posted on the 'general'
> list
> > but it was suggested, that the 'user' list might be a better place to
> ask.
> >
> >  Original Message 
> > Hi,
> >
> > we are passing a multivalued field to the
> > LanguageIdentifierUpdateProcessor.
> > This multivalued field contains arbitrary types (Integer, String, Date).
> >
> > Now, the LanguageIdentifierUpdateProcessor.concatFields(SolrInputDocument
> > doc, String[] fields), which btw does not use the parameter fields, is
> > unable to parse all fields of the/a multivalued field. The call "Object
> > content = doc.getFieldValue(fieldName);" does not care what type the
> field
> > is and just delegates to SolrInputDocument which in turn calls
> > getFirstValue.
> >
> > So, two issues:
> > First - if the first value of the multivalued field is not of type
> String,
> > the field is ignored completely.
> >
> > Second - the concat method does not concat all values of a multivalued
> > field.
> >
> > While http://www.mail-archive.com/solr-

Re: Function query matching

2013-12-02 Thread Trey Grainger
We're working on the same problem with the combination of the
scale(query(...)) combination, so I'd like to share a bit more information
that may be useful.

*On the scale function:*
Even thought the scale query has to calculate the scores for all documents,
it is actually doing this work twice for each ValueSource (once to
calculate the min and max values, and then again when actually scoring the
documents), which is inefficient.

To solve the problem, we're in the process of putting a cache inside the
scale function to remember the values for each document when they are
initially computed (to find the min and max) so that the second pass can
just use the previously computed values for each document.  Our theory is
that most of the extra time due to the scale function is really just the
result of doing duplicate work.

No promises this won't be overly costly in terms of memory utilization, but
we'll see what we get in terms of speed improvements and will share the
code if it works out well.  Alternate implementation suggestions (or
criticism of a cache like this) are also welcomed.


*On the NoOp product function: scale(prod(1, query(...))):*
We do the same thing, which ultimately is just an unnecessary waste of a
loop through all documents to do an extra multiplication step.  I just
debugged the code and uncovered the problem.  There is a Map (called
context) that is passed through to each value source to store intermediate
state, and both the query and scale functions are passing the ValueSource
for the query function in as the KEY to this Map (as opposed to using some
composite key that makes sense in the current context).  Essentially, these
lines are overwriting each other:

Inside ScaleFloatFunction: context.put(this.source, scaleInfo);
 //this.source refers to the QueryValueSource, and the scaleInfo refers to
a ScaleInfo object
Inside QueryValueSource: context.put(this, w); //this refers to the same
QueryValueSource from above, and the w refers to a Weight object

As such, when the ScaleFloatFunction later goes to read the ScaleInfo from
the context Map, it unexpectedly pulls the Weight object out instead and
thus the invalid case exception occurs.  The NoOp multiplication works
because it puts an "different" ValueSource between the query and the
ScaleFloatFunction such that this.source (in ScaleFloatFunction) != this
(in QueryValueSource).

This should be an easy fix.  I'll create a JIRA ticket to use better key
names in these functions and push up a patch.  This will eliminate the need
for the extra NoOp function.

-Trey


On Mon, Dec 2, 2013 at 12:41 PM, Peter Keegan wrote:

> I'm persuing this possible PostFilter solution, I can see how to collect
> all the hits and recompute the scores in a PostFilter, after all the hits
> have been collected (for scaling). Now, I can't see how to get the custom
> doc/score values back into the main query's HitQueue. Any advice?
>
> Thanks,
> Peter
>
>
> On Fri, Nov 29, 2013 at 9:18 AM, Peter Keegan  >wrote:
>
> > Instead of using a function query, could I use the edismax query (plus
> > some low cost filters not shown in the example) and implement the
> > scale/sum/product computation in a PostFilter? Is the query's maxScore
> > available there?
> >
> > Thanks,
> > Peter
> >
> >
> > On Wed, Nov 27, 2013 at 1:58 PM, Peter Keegan  >wrote:
> >
> >> Although the 'scale' is a big part of it, here's a closer breakdown.
> Here
> >> are 4 queries with increasing functions, and theei response times
> (caching
> >> turned off in solrconfig):
> >>
> >> 100 msec:
> >> select?q={!edismax v='news' qf='title^2 body'}
> >>
> >> 135 msec:
> >> select?qq={!edismax v='news' qf='title^2
> >> body'}q={!func}product(field(myfield),query($qq)&fq={!query v=$qq}
> >>
> >> 200 msec:
> >> select?qq={!edismax v='news' qf='title^2
> >>
> body'}q={!func}sum(product(0.75,query($qq)),product(0.25,field(myfield&fq={!query
> >> v=$qq}
> >>
> >> 320 msec:
> >>  select?qq={!edismax v='news' qf='title^2
> >>
> body'}&scaledQ=scale(product(query($qq),1),0,1)&q={!func}sum(product(0.75,$scaledQ),product(0.25,field(myfield)))&fq={!query
> >> v=$qq}
> >>
> >> Btw, that no-op product is necessary, else you get this exception:
> >>
> >> org.apache.lucene.search.BooleanQuery$BooleanWeight cannot be cast to
> org.apache.lucene.queries.function.valuesource.ScaleFloatFunction$ScaleInfo
> >>
> >> thanks,
> >>
> >> peter
> >>
> >>
> >>
> >> On Wed, Nov 27, 2013 at 1:30 PM, Chris Hostetter <
> >> hossman_luc...@fucit.org> wrote:
> >>
> >>>
> >>> : So, this query does just what I want, but it's typically 3 times
> slower
> >>> : than the edismax query  without the functions:
> >>>
> >>> that's because the scale() function is inhernetly slow (it has to
> >>> compute the min & max value for every document in order to know how to
> >>> scale them)
> >>>
> >>> what you are seeing is the price you have to pay to get that query
> with a
> >>> "normalized" 0-1 value.
> >>>
> >>> (you might be able to save a littl

Re: Re: LanguageIdentifierUpdateProcessor uses only firstValue() on multivalued fields

2013-12-12 Thread Trey Grainger
Hmm... haven't run into the case where null was returned in a multi-valued
scenario yet... I probably just haven't tested that case.  I likely need to
add a null check there - thanks for pointing it out.

-Trey


On Fri, Nov 29, 2013 at 6:10 AM, Müller, Stephan <
muel...@ponton-consulting.de> wrote:

> Hello Trey, thank you for this example.
>
> We've solved it by omitting the multivalued field and passing the distinct
> string fields instead, still I go with proposing a patch, so the language
> processor is able to concatenate multivalues by default. I think it's a
> reasonable feature (and can't remember to have ever contributed a patch to
> an open source project)
> My thoughts on the patch implementation are quite the same as Yours,
> iterating on getValues(). I'll have this discussed in the dev-list and
> probably in JIRA.
>
>
> One thing: How do you guard against a possible NPE in line 129
> > (final Object inputValue : inputField.getValues()) {
>
> SolrInputField.getValues() will return NULL if the associated value was
> null. It does not create an empty Collection.
> That, btw, seems to be a minor bug in the javadoc, not stating that this
> method returns null.
>
>
> Regards,
> Stephan - srm
>
> [...]
>
> > The "langsToPrepend" variable above will contain a set of languages,
> where
> > detectLanguage was called separately for each value in the multivalued
> > field.  If you just want to concatenate all the values and detect
> > languages once (as opposed to only using the first value in the
> > multivalued field, like it does today), just concatenate each of the
> input
> > values in the first loop and call detectLanguage once at the end.
> >
> > I wrote code that does this for an example in the Solr in Action book.
> >  The particular example was detecting languages for each value in a
> > multivalued field and then pre-pending the language to the text for the
> > multivalued field (so the analyzer would know which stemmer to use, as
> > they were being dynamically substituted in based upon the language).  The
> > code is available here if you are interested:
> > https://github.com/treygrainger/solr-in-
> >
> action/blob/master/src/main/java/sia/ch14/MultiTextFieldLanguageIdentifier
> > UpdateProcessor.java
> >
> > Good luck!
> >
> > -Trey
> >
> >
> >
> >
> > On Wed, Nov 27, 2013 at 10:16 AM, Müller, Stephan < Mueller@ponton-
> > consulting.de> wrote:
> >
> > > > I suspect that it is an oversight for a use case that was not
> > considered.
> > > > I mean, it should probably either ignore or convert non text/string
> > > > values.
> > > Ok, I'll see that I provide a patch against trunk. It actually ignores
> > > non string values, but is unable to check the remaining values of a
> > > multivalued field.
> > >
> > > > Hmmm... are you using JSON input? I mean, how are the types being
> set?
> > > > Solr XML doesn't have a way to set the value types.
> > > >
> > > No. It's a field with multivalued=true. That results in a
> > > SolrInputField where value (which is defined to be Object) actually
> > holds a List.
> > > This list is populated with Integer, String, Date, you name it.
> > > I'm talking about the actual Java-Datatypes. The values in the list
> > > are probably set by this 3rdparty Textbodyprocessor thingy.
> > >
> > > Now the Language processor just asks for field.getValue().
> > > This is delegated to the SolrInputField which in turn calls
> > > firstValue() Interestingly enough, already is able to handle a
> > Collection as its value.
> > > But if the value is a collection, it just returns the first element.
> > >
> > > > You could workaround it with an update processor that copied the
> > > > field
> > > and
> > > > massaged the multiple values into what you really want the language
> > > > detection to see. You could even implement that processor as a
> > > > JavaScript script with the stateless script update processor.
> > > >
> > > Our workaround would be to not feed the multivalued field but only the
> > > String fields (which are also included in the multivalued field)
> > >
> > >
> > > Filing a Bug/Feature request and providing the patch will take some
> > > time as I haven't setup a fully working trunk in my IDEA installation.
> > > But I'm eager to do it :)
> > >
> > > Regards,
> > > Stephan
> > >
> > >
> > > > -- Jack Krupansky
> > > >
> > > > -Original Message-
> > > > From: Müller, Stephan
> > > > Sent: Wednesday, November 27, 2013 5:02 AM
> > > > To: solr-user@lucene.apache.org
> > > > Subject: LanguageIdentifierUpdateProcessor uses only firstValue() on
> > > > multivalued fields
> > > >
> > > > Hello,
> > > >
> > > > this is a repost. This message was originally posted on the 'general'
> > > list
> > > > but it was suggested, that the 'user' list might be a better place
> > > > to
> > > ask.
> > > >
> > > >  Original Message 
> > > > Hi,
> > > >
> > > > we are passing a multivalued field to the
> > > > LanguageIdentifierUpdateProcessor.
> > > > This multival

Re: Single multilingual field analyzed based on other field values

2013-12-19 Thread Trey Grainger
Hi Dave,

Sorry for the delayed reply.  Did you end up trying the (scary) caching
idea?

Yeah, there's no reasonable way today to access data from other fields from
the document in the analyzers.  Creating an update request processor which
pulls the data prior to the field-by-field analysis and injects it (in some
format) into the field that needs the data pulled from other fields is how
to do this today.

In my examples, I only inserted a prefix prior to the entire field (i.e.
en,es|hables espanol is what she asks), but if you need something more
complicated to identify specific sections of the field to use different
analyzers then you could pull that off, as well.  For example:
[langs="en"]hello world
[langs="en,es"]hables espanol is what she asks.[
autodetectOtherLangs="true" fallbackLangs="en"]some unknown language text
for identification

Then, you would just have the analyzer for the field parse the content,
pass each chunk of text into the appropriate analyzer, and then modify the
term positions and offsets as necessary.  My example in chapter 14 of Solr
in Action assumed you would be using the same languages throughout the
whole field, but it would just require a little bit of pre-parsing work to
direct the use of specific analyers only for specific parts of the content.

Frankly, I'm not sure pulling the data from another field (particularly if
you want different sections processed with different languages) is going to
be much simpler than putting it all into the field to be analyzed to begin
with (or better yet having an update request processor do it for you -
including the detection of language boundaries - inside of Solr so the
customer doesn't have to worry about it).

-Trey


On Tue, Oct 29, 2013 at 12:18 PM, davetroiano wrote:

> Hi Trey,
>
> I was reading v9 of the Solr in Action MEAP but browsing your github repo,
> so I think I'm looking at the latest stuff.
>
> Agreed that the thread caching idea is dangerous.  Perhaps it would work
> now, but it could easily break in a later version of Solr.
>
> I didn't mention another reason why I'd like to analyze based on other
> field
> values, which is that I'd like the ability to run analyzers on sub-sections
> of the MultiTextField.  e.g., given a multilingual document, run my
> text_english analyzer on the first half of a document and my text_french
> analyzer on the second half.  Of course, I could extend the prepend
> approach
> to take start and end offsets (e.g.,  name="myField">[en_0_1000,fr_1001_2500|]blah, blah, ...), but if it
> were possible I'd rather grab that data from another field and simplify the
> tokenizer (in terms of the string manipulation and having to adjust
> position
> offsets to ignore the prepended data... though you've already done the
> tricky part).
>
> Based on what I'm seeing on the message boards and JIRA (e.g., SOLR-1536 /
> SOLR-1327 not being fixed), it seems like there isn't a clean way to run
> analyzers dynamically based on data in other field(s).  If I end up trying
> the caching idea, I'll report my findings here.
>
> Thanks,
> Dave
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Single-multilingual-field-analyzed-based-on-other-field-values-tp4098141p4098242.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>


Re: Facet pivot and distributed search

2014-02-07 Thread Trey Grainger
FYI, the last distributed pivot facet patch functionally works, but there
are some sub-optimal data structures being used and some unnecessary
duplicate processing of values. As a result, we found that for certain
worst-case scenarios (i.e. data is not randomly distributed across Solr
cores and requires significant refinement) pivot facets with multiple
levels could take over a minute to aggregate and process results. This was
using a dataset of several hundred million documents and dozens of pivot
facets across 120 Solr cores distributed over 20 servers, so it is a more
extreme use-case than most will encounter.

Nevertheless, we've refactored the code and data structures and brought the
processing time from over a minute down to less than a second using the
above configuration. We plan to post the patch within the next week.


On Fri, Feb 7, 2014 at 3:08 AM, Geert Van Huychem wrote:

> Thx!
>
> Geert Van Huychem
> IT Consultant
> iFrameWorx BVBA
>
> Mobile: +32 497 27 69 03
> E-mail: ge...@iframeworx.be
> Site: http://www.iframeworx.be
> LinkedIn: http://www.linkedin.com/in/geertvanhuychem
>
>
> On Fri, Feb 7, 2014 at 8:55 AM, Shalin Shekhar Mangar <
> shalinman...@gmail.com> wrote:
>
> > Yes this is a open issue.
> >
> > https://issues.apache.org/jira/browse/SOLR-2894
> >
> > On Fri, Feb 7, 2014 at 1:13 PM, Geert Van Huychem 
> > wrote:
> > > Hi
> > >
> > > I'm using Solr 4.5 in a multi-core environment.
> > >
> > > I've setup
> > > - one core per documenttype: text, rss, tweet and external documents.
> > > - one distrib core which basically distributes the query to the 4 cores
> > > mentioned hereabove.
> > >
> > > Facet pivot works on each core individually, but when I send the exact
> > same
> > > query to the distrib core, I get no results.
> > >
> > > Anyone? Bug? Open issue?
> > >
> > > Best
> > >
> > > Geert Van Huychem
> >
> >
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
>


Re: How to implement multilingual word components fields schema?

2014-09-08 Thread Trey Grainger
Hi Ilia,

When writing *Solr in Action*, I implemented a feature which can do what
you're asking (allow multiple, dynamic analyzers to be used in a single
text field). This would allow you to use the same field and dynamically
change the analyzers (for example, you could do language-identification on
documents and only stem to the identified languages). It also support more
than one Analyzer per field (i.e. if you single documents or queries
containing multiple languages).

This seems to be a feature request which comes up regularly, so I just
submitted a new feature request on JIRA to add this feature to Solr and
track the progress:
https://issues.apache.org/jira/browse/SOLR-6492

I included a comment showing how to use the functionality currently
described in *Solr in Action*, but I plan to make it easier to use over the
next 2 months before calling it done. I'm going to be talking about
multilingual search in November at Lucene/Solr Revolution, so I'd ideally
like to finish before then so I can demonstrate it there.

Thanks,

-Trey Grainger
Director of Engineering, Search & Analytics @ CareerBuilder


On Mon, Sep 8, 2014 at 3:31 PM, Jorge Luis Betancourt Gonzalez <
jlbetanco...@uci.cu> wrote:

> In one of the talks by Trey Grainger (author of Solr in Action) it touches
> how on CareerBuilder are dealing with multilingual with payloads, its a
> little more of work but I think it would payoff.
>
> On Sep 8, 2014, at 7:58 AM, Jack Krupansky 
> wrote:
>
> > You also need to take a stance as to whether you wish to auto-detect the
> language at query time vs. have a UI selection of language vs. attempt to
> perform the same query for each available language and then "determine"
> which has the best "relevancy". The latter two options are very sensitive
> to short queries. Keep in mind that auto-detection for indexing full
> documents is a different problem that auto-detection for very short queries.
> >
> > -- Jack Krupansky
> >
> > -Original Message- From: Ilia Sretenskii
> > Sent: Sunday, September 7, 2014 10:33 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: How to implement multilingual word components fields schema?
> >
> > Thank you for the replies, guys!
> >
> > Using field-per-language approach for multilingual content is the last
> > thing I would try since my actual task is to implement a search
> > functionality which would implement relatively the same possibilities for
> > every known world language.
> > The closest references are those popular web search engines, they seem to
> > serve worldwide users with their different languages and even
> > cross-language queries as well.
> > Thus, a field-per-language approach would be a sure waste of storage
> > resources due to the high number of duplicates, since there are over 200
> > known languages.
> > I really would like to keep single field for cross-language searchable
> text
> > content, witout splitting it into specific language fields or specific
> > language cores.
> >
> > So my current choice will be to stay with just the ICUTokenizer and
> > ICUFoldingFilter as they are without any language specific
> > stemmers/lemmatizers yet at all.
> >
> > Probably I will put the most popular languages stop words filters and
> > stemmers into the same one searchable text field to give it a try and see
> > if it works correctly in a stack.
> > Does specific language related filters stacking work correctly in one
> field?
> >
> > Further development will most likely involve some advanced custom
> analyzers
> > like the "SimplePolyGlotStemmingTokenFilter" to utilize the ICU generated
> > ScriptAttribute.
> > http://comments.gmane.org/gmane.comp.jakarta.lucene.solr.user/100236
> >
> https://github.com/whateverdood/cross-lingual-search/blob/master/src/main/java/org/apache/lucene/sandbox/analysis/polyglot/SimplePolyGlotStemmingTokenFilter.java
> >
> > So I would like to know more about those "academic papers on this issue
> of
> > how best to deal with mixed language/mixed script queries and documents".
> > Tom, could you please share them?
>
> Concurso "Mi selfie por los 5". Detalles en
> http://justiciaparaloscinco.wordpress.com
>


What's the most efficient way to sort by "number of terms matched"?

2014-11-05 Thread Trey Grainger
Just curious if there are some suggestions here. The use case is fairly
simple:

Given a query like  python OR solr OR hadoop, I want to sort results by
"number of keywords matched" first, and by relevancy separately.

I can think of ways to do this, but not efficiently. For example, I could
do:
q=python OR solr OR hadoop&
  p1=python&
  p2=solr&
  p3=hadoop&
  sort=sum(if(query($p1,0),1,0),if(query($p2,0),1,0),if(query($p3,0),1,0))
desc, score desc

Other than the obvious downside that this requires me to pre-parse the
user's query, it's also somewhat inefficient to run the query function once
for each term in the original query since it is re-executing multiple
queries and looping through every document in the index during scoring.

Ideally, I would be able to do something like the below that could just
pull the count of unique matched terms from the main query (q parameter)
execution::
q=python OR solr OR hadoop&sort=uniquematchedterms() desc,score desc.

I don't think anything like this exists, but would love some suggestions if
anyone else has solved this before.

Thanks,

-Trey


Re: Need help understanding the use cases behind core auto-discovery

2013-09-21 Thread Trey Grainger
While on this topic...

Is it still true in Solr 4.5 (RC) that it is not possible to have a shared
config directory?  In general, I like the new core.properties mechanism
better as it removes the unnecessary centralized configuration of cores in
solr.xml, but I have an infrastructure where I have thousands of Solr Cores
with the same configs on a single server, and as last I could tell with
Solr 4.4 the only way to support this in core.properties was to copy and
paste or create symbolic links for the whole conf/ folder for every core
(i.e. thousands of identical copies of all config files in my case).

In the old solr.xml format, we could set the instanceDir to have all cores
reference the same folder, but in core.properties there doesn't seem to be
anything like this.  I tried just referencing solrconfig.xml in another
directory, but because everything is now relative to the conf/ directory
under the folder containing core.properties, none of the referenced files
were in the right place.

Is there any better guidance on migrating to core autodiscovery with the
need for a shared config directory (non-SolrCloud mode)?  This looked
promising, but it sounds dead from Erick's JIRA comment:
https://issues.apache.org/jira/browse/SOLR-4478

Thanks,

-Trey


On Sat, Sep 21, 2013 at 2:25 PM, Erick Erickson wrote:

> Also consider where SolrCloud is going. Trying to correctly maintain
> all the solr.xml files yourself on all the nodes would have
> been..."interesting". On all the machines in your 200 node cluster.
> With 17 different collections. With nodes coming and going. With
> splitting shards. With.
>
> Collections are almost guaranteed to be distributed unevenly (e.g. a
> big collection might have 20 shards and a small collection 3 in the
> same cluster). So each node used to require solr.xml to be unique as
> far as everything in the  tag. But everything  _not_ in the
>  tags is common. Say you wanted to change the
> shardHandlerFactory (or any other setting we put in solr.xml that
> wouldn't have gone into the old  tag). In the old-style way of
> doing things, since each solr.xml file on each node has potentially a
> different set of cores, you'd have to edit each and every one of them.
>
> The older way of doing this is fine as long as each solr.xml on each
> machine is self-consistent. So auto-discovery essentially automates
> that self-consistency.
>
> It also makes it possible to have Zookeeper manage your solr.xml and
> auto-distribute it to new nodes (or update existing) which would have
> taken a lot of effort to get right without auto-discovery. So changing
> the  consists of changing the solr.xml file and
> pushing it to ZooKeeper (don't quite remember the right JIRA, but you
> can do this now).
>
> I suppose it's like all other refactorings. Solr.xml had it's origin
> in the single-core days, then when multi-cores came into being it was
> expanded to include that information, but eventually became, as Yonik
> says, unnecessary central configuration which started becoming a
> limitation.
>
> FWIW,
> Erick
>
> On Fri, Sep 20, 2013 at 9:45 AM, Timothy Potter 
> wrote:
> > Exactly the insight I was looking for! Thanks Yonik ;-)
> >
> >
> > On Fri, Sep 20, 2013 at 10:37 AM, Yonik Seeley 
> wrote:
> >
> >> On Fri, Sep 20, 2013 at 11:56 AM, Timothy Potter 
> >> wrote:
> >> > Trying to add some information about core.properties and
> auto-discovery
> >> in
> >> > Solr in Action and am at a loss for what to tell the reader is the
> >> purpose
> >> > of this feature.
> >>
> >> IMO, it was more a removal of unnecessary central configuration.
> >> You previously had to list the core in solr.xml, and now you don't.
> >> Cores should be fully self-describing so that it should be easy to
> >> move them in the future just by moving the core directory (although
> >> that may not yet work...)
> >>
> >> -Yonik
> >> http://lucidworks.com
> >>
> >> > Can anyone point me to any background information about core
> >> > auto-discovery? I'm not interested in the technical implementation
> >> details.
> >> > Mainly I'm trying to understand the motivation behind having this
> feature
> >> > as it seems unnecessary with the Core Admin API. Best I can tell is it
> >> > removes a manual step of firing off a call to the Core Admin API or
> >> loading
> >> > a core from the Admin UI. If that's it and I'm overthinking it, then
> cool
> >> > but was expecting more of an "ah-ha" moment with this feature ;-)
> >> >
> >> > Any insights you can share are appreciated.
> >> >
> >> > Thanks.
> >> > Tim
> >>
>


Re: Getting a query parameter in a TokenFilter

2013-09-22 Thread Trey Grainger
Hi Isaac,

In the process of writing Solr in Action (http://solrinaction.com), I have
built the solution to SOLR-5053 for the multilingual search chapter (I
didn't realize this ticket existed at the time).  The solution was
something I called a "MultiTextField".  Essentially, the field let's you
map a list of defined pre-fixes to field types and dynamically substitute
in one or more field types based upon the incoming content.

For example:

#schema.xml#
 






#document#

  1
  en,es|the schools, la escuala


#Outputted Token Stream#:
[Position 1]   [Position 2]   [Position 3] [Position 4]
 the   school   la
escuela
 schools
escuel

#query on two languages#
q=en,es|la OR en,es|escuela

 Essentially, this MultiText field type lets you dynamically combine one or
more Analyzers (from a defined field type) and stack the tokens based upon
term positions within each independent Analyzer.  The use case here was
multiple

To answer your original question... at query time, this implementation
requires that you pass the prefix before EACH term in the query, not just
the first term (you can see this in the "q=" I demonstrated above).  If you
have a Token Filter you have developed, you "could" probably accomplish
what you are trying to do the same way.

You could write a custom QParserPlugin that would do this for you I think.
 Alternatively, it may be possible to create a similar implementation that
makes use of a dynamic field name (i.e.  "content|en,fr" as the field
name), which would pull the prefix from the field name and apply it to all
tokens instead of requiring/allowing each token to specify it's own prefix.
 I haven't done this in my implementation, but I could see where it might
be more user-friendly for many Solr users.

I'm just finishing up the "multilingual search" chapter and code now and
will be happy to post it to SOLR-5053 once I finish in the next few days if
this would be helpful to you.

-Trey


On Sat, Sep 21, 2013 at 4:15 PM, Isaac Hebsh  wrote:

> Thought about that again,
> We can do this work as a search component, manipulating the query string.
> The cons are the double QParser work, and the double tokenization work.
>
> Another approach which might solve this issue easily is "Dynamic query
> analyze chain": https://issues.apache.org/jira/browse/SOLR-5053
>
> What would you do?
>
>
> On Tue, Sep 17, 2013 at 10:31 PM, Isaac Hebsh 
> wrote:
>
> > Hi everyone,
> >
> > We developed a TokenFilter.
> > It should act differently, depends on a parameter supplied in the
> > query (for query chain only, not the index one, of course).
> > We found no way to pass that parameter into the TokenFilter flow. I guess
> > that the root cause is because TokenFilter is a pure lucene object.
> >
> > As a last resort, we tried to pass the parameter as the first term in the
> > query text (q=...), and save it as a member of the TokenFilter instance.
> >
> > Although it is ugly, it might work fine.
> > But, the problem is that it is not guaranteed that all the terms of a
> > particular query will be analyzed by the same instance of a TokenFilter.
> In
> > this case, some terms will be analyzed without the required information
> of
> > that "parameter". We can produce such a race very easily.
> >
> > How should I overcome this issue?
> > Do anyone have a better resolution?
> >
>


Re: [ANNOUNCE] Solr wiki editing change

2013-03-30 Thread Trey Grainger
Please add TreyGrainger to the the contributors group.  Thanks!

-Trey


On Sun, Mar 24, 2013 at 11:18 PM, Steve Rowe  wrote:

> The wiki at http://wiki.apache.org/solr/ has come under attack by
> spammers more frequently of late, so the PMC has decided to lock it down in
> an attempt to reduce the work involved in tracking and removing spam.
>
> From now on, only people who appear on
> http://wiki.apache.org/solr/ContributorsGroup will be able to
> create/modify/delete wiki pages.
>
> Please request either on the solr-user@lucene.apache.org or on
> d...@lucene.apache.org to have your wiki username added to the
> ContributorsGroup page - this is a one-time step.
>
> Steve
> -
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>


Solr 3.1: Old Index Files Not Removed on Optimize?

2011-04-15 Thread Trey Grainger
I was just hoping someone might be able to point me in the right direction
here.  We just upgraded from Solr 1.4 to Solr 3.1 this past week and we're
having issues running out of disk space on our Master servers.  Our Master
has dozens of cores.  We have a script that kicks off once per day to do a
rolling optimize.  The script optimizes a single core, waits 5 minutes to
give the server some breathing room to catch up on indexing in a non-i/o
intensive state, and then moves onto the next core (repeating until done).

The problem we are facing is that under Solr 1.4, the old index files were
deleted very quickly after each optimize, but under Solr 3.1, the old index
files hang around for hours... in many cases they don't disappear until we
restart Solr completely.  This is leading to us running out of disk space,
as each core's index doubles in size during the optimize process and stays
that way until the next solr restart.

I was just wondering if anyone could point me to some specific changes or
settings which may be leading to the difference between solr versions (or
any other environmental issues you may know about).  I see several tickets
in Jira about similar issues, but they mostly appear to have been resolved
in the past.

Has anyone else seen this behavior under Solr 3.1, or do you think we may be
missing some kind of new configuration setting?

For reference, we are running on 64bit RedHat Linux.  This is what I have
right now: [From SolrConfig.xml]:
true



commit
optimize
startup



  

  10
  30

  


  false
  1



Thanks in advance,

-Trey


Re: Solr 3.1: Old Index Files Not Removed on Optimize?

2011-04-15 Thread Trey Grainger
Thank you, Yonik!

I see the Jira issue you created and am guessing it's due to this issue.
 We're going to remove replicateAfter="startup" in the mean-time to see if
that helps (assuming this is the issue the jira ticket described).

I appreciate you taking a look at this.

Thanks

-Trey


On Fri, Apr 15, 2011 at 2:58 PM, Yonik Seeley wrote:

> I can reproduce this with the example server w/ your deletionPolicy
> and replicationHandler configs.
> I'll dig further to see what's behind this behavior.
>
> -Yonik
> http://www.lucenerevolution.org -- Lucene/Solr User Conference, May
> 25-26, San Francisco
>
> On Fri, Apr 15, 2011 at 1:14 PM, Trey Grainger  wrote:
> > I was just hoping someone might be able to point me in the right
> direction
> > here.  We just upgraded from Solr 1.4 to Solr 3.1 this past week and
> we're
> > having issues running out of disk space on our Master servers.  Our
> Master
> > has dozens of cores.  We have a script that kicks off once per day to do
> a
> > rolling optimize.  The script optimizes a single core, waits 5 minutes to
> > give the server some breathing room to catch up on indexing in a non-i/o
> > intensive state, and then moves onto the next core (repeating until
> done).
> >
> > The problem we are facing is that under Solr 1.4, the old index files
> were
> > deleted very quickly after each optimize, but under Solr 3.1, the old
> index
> > files hang around for hours... in many cases they don't disappear until
> we
> > restart Solr completely.  This is leading to us running out of disk
> space,
> > as each core's index doubles in size during the optimize process and
> stays
> > that way until the next solr restart.
> >
> > I was just wondering if anyone could point me to some specific changes or
> > settings which may be leading to the difference between solr versions (or
> > any other environmental issues you may know about).  I see several
> tickets
> > in Jira about similar issues, but they mostly appear to have been
> resolved
> > in the past.
> >
> > Has anyone else seen this behavior under Solr 3.1, or do you think we may
> be
> > missing some kind of new configuration setting?
> >
> > For reference, we are running on 64bit RedHat Linux.  This is what I have
> > right now: [From SolrConfig.xml]:
> > true
> >
> > 
> >
> >commit
> >optimize
> >startup
> >
> > 
> >
> >  
> >
> >  10
> >  30
> >
> >  
> >
> >
> >  false
> >  1
> >
> >
> >
> > Thanks in advance,
> >
> > -Trey
> >
>


Apache Spam Filter Blocking Messages

2011-04-20 Thread Trey Grainger
Hey (solr-user) Mailing list admin's,

I've tried replying to a thread multiple times tonight, and keep getting a
bounce-back with this response:
Technical details of permanent failure:
Google tried to deliver your message, but it was rejected by the recipient
domain. We recommend contacting the other email provider for further
information about the cause of this error. The error that the other server
returned was: 552 552 spam score (5.1) exceeded threshold
(FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
(state 18).

Apparently I sound like spam when I write perfectly good English and include
some xml and a link to a jira ticket in my e-mail (I tried a couple
different variations).  Anyone know a way around this filter, or should I
just respond to those involved in the e-mail chain directly and avoid the
mailing list?

Thanks,

-Trey


Re: Apache Spam Filter Blocking Messages

2011-04-21 Thread Trey Grainger
Good to know; I'll go change those settings, then.  Thanks for the feedback.

-Trey


On Thu, Apr 21, 2011 at 4:42 AM, Em  wrote:
>
> This really helps at the mailinglists.
> If you send your mails with Thunderbird, be sure to check that you enforce
> plain-text-emails. If not, it will often send HTML-mails.
>
> Regards,
> Em
>
>
> Marvin Humphrey wrote:
> >
> > On Thu, Apr 21, 2011 at 12:30:29AM -0400, Trey Grainger wrote:
> >> (FREEMAIL_FROM,FS_REPLICA,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL
> >                             
> > Note the "HTML_MESSAGE" in the list of things SpamAssassin didn't like.
> >
> >> Apparently I sound like spam when I write perfectly good English and
> >> include
> >> some xml and a link to a jira ticket in my e-mail (I tried a couple
> >> different variations).  Anyone know a way around this filter, or should I
> >> just respond to those involved in the e-mail chain directly and avoid the
> >> mailing list?
> >
> > Send plain text email instead of HTML.  That solves the problem 99% of the
> > time.
> >
> > Marvin Humphrey
> >
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Apache-Spam-Filter-Blocking-Messages-tp2845854p2846304.html
> Sent from the Solr - User mailing list archive at Nabble.com.


Re: old searchers not closing after optimize or replication

2011-04-21 Thread Trey Grainger
Hey Bernd,

Checkout https://issues.apache.org/jira/browse/SOLR-2469.  There is a
pretty bad bug in Solr 3.1 which occurs if you have  startup set in your replication
configuration in solrconfig.xml.  See the thread between Yonik and
myself from a few days ago titled "Solr 3.1: Old Index Files Not
Removed on Optimize".

You can disable startup replication and perform an optimize to see if
this fixes your problem of old index files being left behind (though
you may have some old index files left behind from before this change
that you still need to clean-up).  Yonik has already pushed up a patch
into the 3x branch and trunk for this issue.  I can confirm that
applying the patch (or just removing startup replication) resolved the
issue for us.

Do you think this is your issue?

Thanks,

-Trey



On Thu, Apr 21, 2011 at 2:27 AM, Bernd Fehling
 wrote:
> Hi Erik,
>
> 
> 1
> 0
> 
>
> Due to 44 minutes optimization time we do an optimization once a day
> during the night.
>
> I will try with an smaler index on my development system.
>
> Best regards,
> Bernd
>
>
> Am 20.04.2011 17:50, schrieb Erick Erickson:
>>
>> It looks OK, but still doesn't explain keeping the old files around. What
>> is
>> your  in your solrconfig.xml look like? It's
>> possible that you're seeing Solr attempt to keep around several
>> optimized copies of the index, but that still doesn't explain why
>> restarting Solr removes them unless the deletionPolicy gets invoked
>> on sometime and you're index files are aging out (I don't know the
>> internals of deletion well enough to say).
>>
>> About optimization. It's become less important with recent code. Once
>> upon a time, it made a substantial difference in search speed. More
>> recently, it has very little impact on search speed, and is used
>> much more sparingly. Its greatest benefit is reclaiming unused resources
>> left over from deleted documents. So you might want to avoid the pain
>> of optimizing (44 minutes!) and only optimize rarely of if you have
>> deleted a lot of documents.
>>
>> It might be worthwhile to try (with a smaller index !) a bunch of optimize
>> cycles and see if the  idea has any merit. I'd expect
>> your index to reach a maximum and stay there after the saved
>> copies of the index was reached...
>>
>> But otherwise I'm puzzled...
>>
>> Erick
>>
>> On Wed, Apr 20, 2011 at 10:30 AM, Bernd Fehling
>>   wrote:
>>>
>>> Hi Erik,
>>>
>>> Am 20.04.2011 15:42, schrieb Erick Erickson:

 H, this isn't right. You've pretty much eliminated the obvious
 things. What does lsof show? I'm assuming it shows the files are
 being held open by your Solr instance, but it's worth checking.
>>>
>>> Just commited new content 3 times and finally optimized.
>>> Again having old index files left.
>>>
>>> Then checked on my master, only the newest version of index files are
>>> listed with lsof. No file handles to the old index files but the
>>> old index files remain in data/index/.
>>> Thats strange.
>>>
>>> This time replication worked fine and cleaned up old index on slaves.
>>>

 I'm not getting the same behavior, admittedly on a Windows box.
 The only other thing I can think of is that you have a query that's
 somehow never ending, but that's grasping at straws.

 Do your log files show anything interesting?
>>>
>>> Lets see:
>>> - it has the old generation (generation=12) and its files
>>> - and recognizes that there have been several commits (generation=18)
>>>
>>> 20.04.2011 14:05:26 org.apache.solr.update.DirectUpdateHandler2 commit
>>> INFO: start
>>>
>>> commit(optimize=true,waitFlush=false,waitSearcher=true,expungeDeletes=false)
>>> 20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy onInit
>>> INFO: SolrDeletionPolicy.onInit: commits:num=2
>>>
>>>
>>>  commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_c,version=1302159868435,generation=12,filenames=[_3xm.nrm,
>>> _3xm.fdx, segment
>>> s_c, _3xm.fnm, _3xm.fdt, _3xm.tis, _3xm.tii, _3xm.prx, _3xm.frq]
>>>
>>>
>>>  commit{dir=/srv/www/solr/solr/solrserver/solr/data/index,segFN=segments_i,version=1302159868447,generation=18,filenames=[_3xm.nrm,
>>> _3xo.tis, _3xp.pr
>>> x, _3xo.fnm, _3xp.fdx, _3xs.frq, _3xo.tii, _3xp.fdt, _3xn.tii, _3xm.fdx,
>>> _3xn.nrm, _3xm.fdt, _3xs.prx, _3xn.tis, _3xn.fdx, _3xr.nrm, _3xm.prx,
>>> _3xn.fdt, _3x
>>> p.tii, _3xs.nrm, _3xp.tis, _3xo.prx, segments_i, _3xm.tii, _3xq.tii,
>>> _3xs.fdx, _3xs.fdt, _3xo.frq, _3xn.prx, _3xm.tis, _3xr.prx, _3xq.tis,
>>> _3xo.fdt, _3xp.fr
>>> q, _3xq.fnm, _3xo.fdx, _3xp.fnm, _3xr.tis, _3xr.fnm, _3xq.frq, _3xr.tii,
>>> _3xr.frq, _3xo.nrm, _3xs.tii, _3xq.fdx, _3xq.fdt, _3xp.nrm, _3xq.prx,
>>> _3xs.tis, _3x
>>> m.frq, _3xr.fdx, _3xm.fnm, _3xn.frq, _3xq.nrm, _3xs.fnm, _3xn.fnm,
>>> _3xr.fdt]
>>> 20.04.2011 14:05:26 org.apache.solr.core.SolrDeletionPolicy updateCommits
>>> INFO: newest commit = 1302159868447
>>>
>>>
>>> - after 44 minutes of optimizing (over 140GB and 27.8 mio docs) it gets
>>>  the Sol

Re: Indexes in ramdisk don't show performance improvement?

2011-06-02 Thread Trey Grainger
Linux will cache the open index files in RAM (in the filesystem cache)
after their first read which makes the ram disk generally useless.
Unless you're processing other files on the box with a size greater
than your total unused ram (and thus need to micro-manage what stays
in RAM), then I wouldn't recommend using a ramdisk - it's just more to
manage.  If you reboot the box and run a few searches, those first few
will likely be slower until all the index files are cached in Memory.
After that point, the performance should be comparable because all
files are read out of RAM from that point forward.

If solr caches are enabled and your queries are repetitive then that
could also be contributing to the speed of repetitive queries.  Note
that the above advice assumes your total unused ram (not allocated to
the JVM or any other processes) is greater than the size of your
lucene index files, which should be a safe assumption considering
you're trying to put the whole index in a ramdisk.

-Trey


On Thu, Jun 2, 2011 at 7:15 PM, Erick Erickson  wrote:
> What I expect is happening is that the Solr caches are effectively making the
> two tests identical, using memory to hold the vital parts of the code in both
> cases (after disk warming on the instance using the local disk). I suspect if
> you measured the first few queries (assuming no auto-warming) you'd see the
> local disk version be slower.
>
> Were you running these tests for curiosity or is running from /dev/shm 
> something
> you're considering for production?
>
> Best
> Erick
>
> On Thu, Jun 2, 2011 at 5:47 PM, Parker Johnson  wrote:
>>
>> Hey everyone.
>>
>> Been doing some load testing over the past few days. I've been throwing a
>> good bit of load at an instance of solr and have been measuring response
>> time.  We're running a variety of different keyword searches to keep
>> solr's cache on its toes.
>>
>> I'm running two exact same load testing scenarios: one with indexes
>> residing in /dev/shm and another from local disk.  The indexes are about
>> 4.5GB in size.
>>
>> On both tests the response times are the same.  I wasn't expecting that.
>> I do see the java heap size grow when indexes are served from disk (which
>> is expected).  When the indexes are served out of /dev/shm, the java heap
>> stays small.
>>
>> So in general is this consistent behavior?  I don't really see the
>> advantage of serving indexes from /dev/shm.  When the indexes are being
>> served out of ramdisk, is the linux kernel or the memory mapper doing
>> something tricky behind the scenes to use ramdisk in lieu of the java heap?
>>
>> For what it is worth, we are running x_64 rh5.4 on a 12 core 2.27Ghz Xeon
>> system with 48GB ram.
>>
>> Thoughts?
>>
>> -Park
>>
>>
>>
>


Re: Can I invert the inverted index?

2011-07-05 Thread Trey Grainger
Gabriele,

I created a patch that does this about a year ago.  See
https://issues.apache.org/jira/browse/SOLR-1837.  It was written for Solr
1.4 and is based upon the Document Reconstructor in Luke.  The patch adds a
link to the main solr admin page to a docinspector page which will
reconstruct the document given a uniqueid (required).  Keep in mind that
you're only looking at what's "in" the index for non-stored fields, not the
original text.

If you have any issues using this on the most recent release, let me know
and I'd be happy to create a new patch for solr 3.3.  One of these days I'll
remove the JSP dependency and this may eventually making it into trunk.

Thanks,

-Trey Grainger
Search Technology Development Team Lead, Careerbuilder.com
Site Architect, Celiaccess.com


On Tue, Jul 5, 2011 at 3:59 PM, Gabriele Kahlout
wrote:

> Hello,
>
> With an inverted index the term is the key, and the documents are the
> values. Is it still however possible that given a document id I get the
> terms indexed for that document?
>
> --
> Regards,
> K. Gabriele
>
> --- unchanged since 20/9/10 ---
> P.S. If the subject contains "[LON]" or the addressee acknowledges the
> receipt within 48 hours then I don't resend the email.
> subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧
> time(x)
> < Now + 48h) ⇒ ¬resend(I, this).
>
> If an email is sent by a sender that is not a trusted contact or the email
> does not contain a valid code then the email is not received. A valid code
> starts with a hyphen and ends with "X".
> ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈
> L(-[a-z]+[0-9]X)).
>


Re: resetting stats

2010-03-30 Thread Trey Grainger
You can reload the core on which you want to reset the stats - this lets you
keep the engine up and running without requiring you restart Solr.  If you
have an separate core for aggregating (i.e. a core that contains no data and
has no caches) then the overhead for reloading that core is negligable and
the time to reload is essentially zero.

The primary disadvantage of the core reloading approach is that your warmed
caches are dropped (if you are using caches on that core), but as long as
you have good warmup queries you should be okay as long as the reload isn't
constant.

-Trey

On Tue, Mar 30, 2010 at 8:10 PM, Chris Hostetter
wrote:

>
> : Is there a way to reset the stats counters? For example in the Query
> handler
> : avgTimePerRequest is not much use after a while as it is an avg since the
> : server started.
>
> not at the moment ... but it would probably be fairly straight forward to
> add as a new option if you want to file a Jira isssue (and maybe take a
> crak at a patch)
>
>
>
> -Hoss
>
>


Re: resetting stats

2010-03-31 Thread Trey Grainger
: reloading the core just to reset the stats definitely seems like throwing
: out the baby with the bathwater.

Agreed about throwing out the baby with the bath water - if stats need to be
reset, though, then that's the only way today.  A reset stats button would
be a nice way to prevent having to do this.

: Huh? ... how would having an extra core (with no data) help you with
: getting aggregate stats from your request handlers?

Say I have 3 Cores names core0, core1, and core2, where only core1 and core2
have documents and caches.  If all my searches hit core0, and core0 shards
out to core1 and core2, then the stats from core0 would be accurate for
errors, timeouts, totalTime, avgTimePerRequest, avgRequestsPerSecond, etc.
Obviously this is based upon the following two assumptions: 1) The request
handlers you are using/monitoring are distributed aware, and 2) you are
using distributed search and all your queries are going to an aggregating
core.

I'm not suggesting that anyone needs a setup like this, just pointing out
that this type of setup somewhat avoids throwing the baby out with the bath
water by not putting a baby in the bath water that is going to be thrown out
(core0).


On Wed, Mar 31, 2010 at 6:40 PM, Chris Hostetter
wrote:

>
> : You can reload the core on which you want to reset the stats - this lets
> you
> : keep the engine up and running without requiring you restart Solr.  If
> you
>
> reloading the core just to reset the stats definitely seems like throwing
> out the baby with the bathwater.
>
> : have an separate core for aggregating (i.e. a core that contains no data
> and
> : has no caches) then the overhead for reloading that core is negligable
> and
> : the time to reload is essentially zero.
>
> Huh? ... how would having an extra core (with no data) help you with
> getting aggregate stats from your request handlers?  If you want to know
> the avgTImePerRequest from handlerA, that numberisn't going to be useful
> if it comes from a core that isn't what your users are querying
> against
>
> : > : Is there a way to reset the stats counters? For example in the Query
> : > handler
> : > : avgTimePerRequest is not much use after a while as it is an avg since
> the
> : > : server started.
>
>
> -Hoss
>
>


Re: Luke browser does not show non-String Solr fields?

2010-05-31 Thread Trey Grainger
I submitted a patch a few months back for a Solr Document Inspector which
allows one to see the indexed values for any document in a Solr index (
https://issues.apache.org/jira/browse/SOLR-1837). This is more or less a
port of Luke's DocumentReconstructor into Solr, but the tool additionally
has access to all the solr schema/field type information for display
purposes (i.e. Trie Fields are human-readable).

This won't help you search for values in an index or inspect anything at a
macro level (i.e. term counts across the index), but there are other tools
in Solr for that.  Given a UniqueID, however, you can view all the indexed
values for each field in that particular document.  You can always do a
search within Solr for the values you are looking for and then use this tool
to view the indexed values for any documents which match.

This may or may not help you (I'm can't tell what problem you are trying to
solve), but I thought it would be worth mentioning as one tool in your
toolbox.

-Trey

>
>
>
>


Re: IRA or IRA the Person

2019-04-01 Thread Trey Grainger
Hi Brett,

There are a couple of angles you can take here. If you are only concerned
about this specific term or a small number of other known terms like "IRA"
and want to spot fix it, you can use something like the query elevation
component in Solr (
https://lucene.apache.org/solr/guide/7_7/the-query-elevation-component.html)
to explicitly include or exclude documents.

Otherwise, if you are looking for a more data-driven approach to solving
this, you can leverage the aggregate click-streams for your users across
all of the searches on your platform to boost documents higher that are
more popular for any given search. We do this in our commercial product
(Lucidworks Fusion) through our Signals Boosting feature, but you could
implement something similar yourself with some work, as the general
architecture is fairly well-documented here:
https://doc.lucidworks.com/fusion-ai/4.2/user-guide/signals/index.html

If you do not have long-lived content OR your do not have sufficient
signals history, you could alternatively use something like Solr's Semantic
Knowledge Graph to automatically find term vectors that are the most
related to your terms within your content. In that case, if the "individual
retirement account" meaning is more common across your documents, you'd
probably end up with terms more related to that which could be used to do
data-driven boosts on your query to that concept (instead of the person, in
this case).

I gave a presentation at Activate ("the Search & AI Conference") last year
on some of the more data-driven approaches to parsing and understanding the
meaning of terms within queries, that included things like disambiguation
(similar to what you're doing here) and some additional approaches
leveraging a combination of query log mining, the semantic knowledge graph,
and the Solr Text Tagger. If you start handling these use cases in a more
systematic and data-driven way, you might want to check out some of the
techniques I mentioned there: Video:
https://www.youtube.com/watch?v=4fMZnunTRF8 | Slides:
https://www.slideshare.net/treygrainger/how-to-build-a-semantic-search-system


All the best,

Trey Grainger
Chief Algorithms Officer @ Lucidworks


On Mon, Apr 1, 2019 at 11:45 AM Moyer, Brett  wrote:

> Hello,
>
> Looking for ideas on how to determine intent and drive results to
> a person result or an article result. We are a financial institution and we
> have IRA's Individual Retirement Accounts and we have a page that talks
> about an Advisor, IRA Black.
>
> Our users are in a bad habit of only using single terms for
> search. A very common search term is "ira". The PERSON page ranks higher
> than the article on IRA's. With essentially no information from the user,
> what are some way we can detect and rank differently? Thanks!
>
> Brett Moyer
> *
> This e-mail may contain confidential or privileged information.
> If you are not the intended recipient, please notify the sender
> immediately and then delete it.
>
> TIAA
> *
>


Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Trey Grainger
Proposal:
"A Solr COLLECTION is composed of one or more SHARDS, which each have one
or more REPLICAS. Each replica can have a ROLE of either:
1) A LEADER, which can process external updates for the shard
2) A FOLLOWER, which receives updates from another replica"

(Note: I prefer "role" but if others think it's too overloaded due to the
overseer role, we could replace it with "mode" or something similar)
---

To be explicit with the above definitions:
1) In SolrCloud, the roles of leaders and followers can dynamically change
based upon the status of the cluster. In standalone mode, they can be
changed by manual intervention.
2) A leader does not have to have any followers (i.e. only one active
replica)
3) Each shard always has one leader.
4) A follower can also pull updates from another follower instead of a
leader (traditionally known as a REPEATER). A repeater is still a follower,
but would not be considered a leader because it can't process external
updates.
5) A replica cannot be both a leader and a follower.

In addition to the above roles, each replica can have a TYPE of one of:
1) NRT - which can serve in the role of leader or follower
2) TLOG - which can only serve in the role of follower
3) PULL - which can only serve in the role of follower

A replica's type may be changed automatically in the event that its role
changes.

I think this terminology is consistent with the current Leader/Follower
usage while also being able to easily accomodate a rename of the historical
master/slave terminology without mental gymnastics or the introduction or
more cognitive load through new terminology. I think adopting the
Primary/Replica terminology will be incredibly confusing given the already
specific and well established meaning of "replica" within Solr.

All the Best,

Trey Grainger
Founder, Searchkernel
https://searchkernel.com



On Wed, Jun 17, 2020 at 3:38 PM Anshum Gupta  wrote:

> Hi everyone,
>
> Moving a conversation that was happening on the PMC list to the public
> forum. Most of the following is just me recapping the conversation that has
> happened so far.
>
> Some members of the community have been discussing getting rid of the
> master/slave nomenclature from Solr.
>
> While this may require a non-trivial effort, a general consensus so far
> seems to be to start this process and switch over incrementally, if a
> single change ends up being too big.
>
> There have been a lot of suggestions around what the new nomenclature might
> look like, a few people don’t want to overlap the naming here with what
> already exists in SolrCloud i.e. leader/follower.
>
> Primary/Replica was an option that was suggested based on what other
> vendors are moving towards based on Wikipedia:
> https://en.wikipedia.org/wiki/Master/slave_(technology)
> , however there were concerns around the use of “replica” as that denotes a
> very specific concept in SolrCloud. Current terminology clearly
> differentiates the use of the traditional replication model from SolrCloud
> and reusing the names would make it difficult for that to happen.
>
> There were similar concerns around using Leader/follower.
>
> Let’s continue this conversation here while making sure that we converge
> without much bike-shedding.
>
> -Anshum
>


Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Trey Grainger
I guess I don't see it as polysemous, but instead simplifying.

In my proposal, the terms "leader" and "follower" would have the exact same
meaning in both SolrCloud and standalone mode. The only difference would be
that SolrCloud automatically manages the leaders and followers, whereas in
standalone mode you have to manage them manually (as is the case with most
things in SolrCloud vs. Standalone).

My view is that having an entirely different set of terminology describing
the same thing is way more cognitive overhead than having consistent
terminology.

Trey Grainger
Founder, Searchkernel
https://searchkernel.com

On Wed, Jun 17, 2020 at 4:50 PM Walter Underwood 
wrote:

> I strongly disagree with using the Solr Cloud leader/follower terminology
> for non-Cloud clusters. People in my company are confused enough without
> using polysemous terminology.
>
> “This node is the leader, but it means something different than the leader
> in this other cluster.” I’m dreading that conversation.
>
> I like “principal”. How about “clone” for the slave role? That suggests
> that
> it does not accept updates and that it is loosely-coupled, only depending
> on the state of the no-longer-called-master.
>
> Chegg has five production Solr Cloud clusters and one production
> master/slave
> cluster, so this is not a hypothetical for us. We have 100+ Solr hosts in
> production.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jun 17, 2020, at 1:36 PM, Trey Grainger  wrote:
> >
> > Proposal:
> > "A Solr COLLECTION is composed of one or more SHARDS, which each have one
> > or more REPLICAS. Each replica can have a ROLE of either:
> > 1) A LEADER, which can process external updates for the shard
> > 2) A FOLLOWER, which receives updates from another replica"
> >
> > (Note: I prefer "role" but if others think it's too overloaded due to the
> > overseer role, we could replace it with "mode" or something similar)
> > ---
> >
> > To be explicit with the above definitions:
> > 1) In SolrCloud, the roles of leaders and followers can dynamically
> change
> > based upon the status of the cluster. In standalone mode, they can be
> > changed by manual intervention.
> > 2) A leader does not have to have any followers (i.e. only one active
> > replica)
> > 3) Each shard always has one leader.
> > 4) A follower can also pull updates from another follower instead of a
> > leader (traditionally known as a REPEATER). A repeater is still a
> follower,
> > but would not be considered a leader because it can't process external
> > updates.
> > 5) A replica cannot be both a leader and a follower.
> >
> > In addition to the above roles, each replica can have a TYPE of one of:
> > 1) NRT - which can serve in the role of leader or follower
> > 2) TLOG - which can only serve in the role of follower
> > 3) PULL - which can only serve in the role of follower
> >
> > A replica's type may be changed automatically in the event that its role
> > changes.
> >
> > I think this terminology is consistent with the current Leader/Follower
> > usage while also being able to easily accomodate a rename of the
> historical
> > master/slave terminology without mental gymnastics or the introduction or
> > more cognitive load through new terminology. I think adopting the
> > Primary/Replica terminology will be incredibly confusing given the
> already
> > specific and well established meaning of "replica" within Solr.
> >
> > All the Best,
> >
> > Trey Grainger
> > Founder, Searchkernel
> > https://searchkernel.com
> >
> >
> >
> > On Wed, Jun 17, 2020 at 3:38 PM Anshum Gupta 
> wrote:
> >
> >> Hi everyone,
> >>
> >> Moving a conversation that was happening on the PMC list to the public
> >> forum. Most of the following is just me recapping the conversation that
> has
> >> happened so far.
> >>
> >> Some members of the community have been discussing getting rid of the
> >> master/slave nomenclature from Solr.
> >>
> >> While this may require a non-trivial effort, a general consensus so far
> >> seems to be to start this process and switch over incrementally, if a
> >> single change ends up being too big.
> >>
> >> There have been a lot of suggestions around what the new nomenclature
> might
> >> look like, a few people don’t want to overlap the naming here with what
> >> al

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Trey Grainger
Hi Walter,

>In Solr Cloud, the leader knows about each follower and updates them.
Respectfully, I think you're mixing the "TYPE" of replica with the role of
the "leader" and "follower"

In SolrCloud, only if the TYPE of a follower is NRT or TLOG does the leader
push updates those followers.

When the TYPE of a follower is PULL, then it does not.  In Standalone mode,
the type of a (currently) master would be NRT, and the type of the
(currently) slaves is always PULL.

As such, this behavior is consistent across both SolrCloud and Standalone
mode. It is true that Standalone mode does not currently have support for
two of the replica TYPES that SolrCloud mode does, but I maintain that
leader vs. follower behavior is inconsistent here.

Trey Grainger
Founder, Searchkernel
https://searchkernel.com



On Wed, Jun 17, 2020 at 5:41 PM Walter Underwood 
wrote:

> But they are not the same. In Solr Cloud, the leader knows about each
> follower and updates them. In standalone, the master has no idea that
> slaves exist until a replication request arrives.
>
> In Solr Cloud, the leader is elected. In standalone, that role is fixed at
> config load time.
>
> Looking ahead in my email inbox, publisher/subscriber is an excellent
> choice.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
> > On Jun 17, 2020, at 2:21 PM, Trey Grainger  wrote:
> >
> > I guess I don't see it as polysemous, but instead simplifying.
> >
> > In my proposal, the terms "leader" and "follower" would have the exact
> same
> > meaning in both SolrCloud and standalone mode. The only difference would
> be
> > that SolrCloud automatically manages the leaders and followers, whereas
> in
> > standalone mode you have to manage them manually (as is the case with
> most
> > things in SolrCloud vs. Standalone).
> >
> > My view is that having an entirely different set of terminology
> describing
> > the same thing is way more cognitive overhead than having consistent
> > terminology.
> >
> > Trey Grainger
> > Founder, Searchkernel
> > https://searchkernel.com
> >
> > On Wed, Jun 17, 2020 at 4:50 PM Walter Underwood 
> > wrote:
> >
> >> I strongly disagree with using the Solr Cloud leader/follower
> terminology
> >> for non-Cloud clusters. People in my company are confused enough without
> >> using polysemous terminology.
> >>
> >> “This node is the leader, but it means something different than the
> leader
> >> in this other cluster.” I’m dreading that conversation.
> >>
> >> I like “principal”. How about “clone” for the slave role? That suggests
> >> that
> >> it does not accept updates and that it is loosely-coupled, only
> depending
> >> on the state of the no-longer-called-master.
> >>
> >> Chegg has five production Solr Cloud clusters and one production
> >> master/slave
> >> cluster, so this is not a hypothetical for us. We have 100+ Solr hosts
> in
> >> production.
> >>
> >> wunder
> >> Walter Underwood
> >> wun...@wunderwood.org
> >> http://observer.wunderwood.org/  (my blog)
> >>
> >>> On Jun 17, 2020, at 1:36 PM, Trey Grainger  wrote:
> >>>
> >>> Proposal:
> >>> "A Solr COLLECTION is composed of one or more SHARDS, which each have
> one
> >>> or more REPLICAS. Each replica can have a ROLE of either:
> >>> 1) A LEADER, which can process external updates for the shard
> >>> 2) A FOLLOWER, which receives updates from another replica"
> >>>
> >>> (Note: I prefer "role" but if others think it's too overloaded due to
> the
> >>> overseer role, we could replace it with "mode" or something similar)
> >>> ---
> >>>
> >>> To be explicit with the above definitions:
> >>> 1) In SolrCloud, the roles of leaders and followers can dynamically
> >> change
> >>> based upon the status of the cluster. In standalone mode, they can be
> >>> changed by manual intervention.
> >>> 2) A leader does not have to have any followers (i.e. only one active
> >>> replica)
> >>> 3) Each shard always has one leader.
> >>> 4) A follower can also pull updates from another follower instead of a
> >>> leader (traditionally known as a REPEATER). A repeater is still a
> >> follower,
> >>> but would not be considered a leader beca

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Trey Grainger
Sorry:
>
> but I maintain that leader vs. follower behavior is inconsistent here.


Sorry, that should have said "I maintain that leader vs. follower behavior
is consistent here."

Trey Grainger
Founder, Searchkernel
https://searchkernel.com

On Wed, Jun 17, 2020 at 6:03 PM Trey Grainger  wrote:

> Hi Walter,
>
> >In Solr Cloud, the leader knows about each follower and updates them.
> Respectfully, I think you're mixing the "TYPE" of replica with the role of
> the "leader" and "follower"
>
> In SolrCloud, only if the TYPE of a follower is NRT or TLOG does the
> leader push updates those followers.
>
> When the TYPE of a follower is PULL, then it does not.  In Standalone
> mode, the type of a (currently) master would be NRT, and the type of the
> (currently) slaves is always PULL.
>
> As such, this behavior is consistent across both SolrCloud and Standalone
> mode. It is true that Standalone mode does not currently have support for
> two of the replica TYPES that SolrCloud mode does, but I maintain that
> leader vs. follower behavior is inconsistent here.
>
> Trey Grainger
> Founder, Searchkernel
> https://searchkernel.com
>
>
>
> On Wed, Jun 17, 2020 at 5:41 PM Walter Underwood 
> wrote:
>
>> But they are not the same. In Solr Cloud, the leader knows about each
>> follower and updates them. In standalone, the master has no idea that
>> slaves exist until a replication request arrives.
>>
>> In Solr Cloud, the leader is elected. In standalone, that role is fixed at
>> config load time.
>>
>> Looking ahead in my email inbox, publisher/subscriber is an excellent
>> choice.
>>
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>>
>> > On Jun 17, 2020, at 2:21 PM, Trey Grainger  wrote:
>> >
>> > I guess I don't see it as polysemous, but instead simplifying.
>> >
>> > In my proposal, the terms "leader" and "follower" would have the exact
>> same
>> > meaning in both SolrCloud and standalone mode. The only difference
>> would be
>> > that SolrCloud automatically manages the leaders and followers, whereas
>> in
>> > standalone mode you have to manage them manually (as is the case with
>> most
>> > things in SolrCloud vs. Standalone).
>> >
>> > My view is that having an entirely different set of terminology
>> describing
>> > the same thing is way more cognitive overhead than having consistent
>> > terminology.
>> >
>> > Trey Grainger
>> > Founder, Searchkernel
>> > https://searchkernel.com
>> >
>> > On Wed, Jun 17, 2020 at 4:50 PM Walter Underwood > >
>> > wrote:
>> >
>> >> I strongly disagree with using the Solr Cloud leader/follower
>> terminology
>> >> for non-Cloud clusters. People in my company are confused enough
>> without
>> >> using polysemous terminology.
>> >>
>> >> “This node is the leader, but it means something different than the
>> leader
>> >> in this other cluster.” I’m dreading that conversation.
>> >>
>> >> I like “principal”. How about “clone” for the slave role? That suggests
>> >> that
>> >> it does not accept updates and that it is loosely-coupled, only
>> depending
>> >> on the state of the no-longer-called-master.
>> >>
>> >> Chegg has five production Solr Cloud clusters and one production
>> >> master/slave
>> >> cluster, so this is not a hypothetical for us. We have 100+ Solr hosts
>> in
>> >> production.
>> >>
>> >> wunder
>> >> Walter Underwood
>> >> wun...@wunderwood.org
>> >> http://observer.wunderwood.org/  (my blog)
>> >>
>> >>> On Jun 17, 2020, at 1:36 PM, Trey Grainger 
>> wrote:
>> >>>
>> >>> Proposal:
>> >>> "A Solr COLLECTION is composed of one or more SHARDS, which each have
>> one
>> >>> or more REPLICAS. Each replica can have a ROLE of either:
>> >>> 1) A LEADER, which can process external updates for the shard
>> >>> 2) A FOLLOWER, which receives updates from another replica"
>> >>>
>> >>> (Note: I prefer "role" but if others think it's too overloaded due to
>> the
>> >>> overseer role, we could replace it with "mode" or something similar)
>> >>> ---

Re: Getting rid of Master/Slave nomenclature in Solr

2020-06-17 Thread Trey Grainger
@Shawn,

Ok, yeah, apologies, my semantics were wrong.

I was thinking that a TLog replica is a follower role only and becomes an
NRT replica if it gets elected leader. From a pure semantics standpoint,
though, I guess technically the TLog replica doesn't "become" an NRT
replica, but just "acts the same" as if it was an NRT replica when it gets
elected as leader. From the docs regarding TLog replicas: "This type of
replica maintains a transaction log but does not index document changes
locally... When this type of replica needs to update its index, it does so
by replicating the index from the leader... If it does become a leader, it
will behave the same as if it was a NRT type of replica."

The Tlog replicas are a bit of a red herring to the point I was making,
though, which is that Pull Replicas in SolrCloud mode and Slaves in
non-SolrCloud mode both just pull the index from the leader/master and as
opposed to updates being pushed the other way. As such, I don't see a
meaningful distinction between master/slave and leader/follower behavior in
non-SolrCloud mode vs. SolrCloud mode for the specific functionality we're
talking about renaming (Solr cores that pull indices from other Solr cores).

At any rate, this is not a hill I care to die on. My belief is that it's
better to have consistent terminology for what I see as essentially the
same functionality. I respect that others disagree and would rather
introduce new terminology to clearly distinguish between modes. Regardless
of the naming decided on, I'm in support of removing the master/slave
nomenclature.

Trey Grainger
Founder, Searchkernel
https://searchkernel.com

On Wed, Jun 17, 2020 at 7:00 PM Shawn Heisey  wrote:

> On 6/17/2020 2:36 PM, Trey Grainger wrote:
> > 2) TLOG - which can only serve in the role of follower
>
> This is inaccurate.  TLOG can become leader.  If that happens, then it
> functions exactly like an NRT leader.
>
> I'm aware that saying the following is bikeshedding ... but I do think
> it would be as mistake to use any existing SolrCloud terminology for
> non-cloud deployments, including the word "replica".  The top contenders
> I have seen to replace master/slave in Solr are primary/secondary and
> publisher/subscriber.
>
> It has been interesting watching this discussion play out on multiple
> open source mailing lists.  On other projects, I have seen a VERY high
> level of resistance to these changes, which I find disturbing and
> surprising.
>
> Thanks,
> Shawn
>


Re: [EXTERNAL] Getting rid of Master/Slave nomenclature in Solr

2020-06-18 Thread Trey Grainger
>
> Let’s instead find a new good name for the cluster type. Standalone kind
> of works
> for me, but I see it can be confused with single-node.

Yeah, I've typically referred to it as "standalone", but I don't think it's
descriptive enough. I can see why some people have been calling it
"master/slave" mode in lieu of a more descriptive alternative. I think a
new name (other than "standalone" or "legacy") would be superb.

We have also discussed replacing SolrCloud (which is a terrible name) with
> something more descriptive.

Today: SolrCloud vs Master/slave
> Alt A: SolrCloud vs Standalone
> Alt B: SolrCloud vs Legacy
> Alt C: Clustered vs Independent
> Alt D: Clustered vs Manual mode


+1 SolrCloud is even less descriptive and IMHO just sounds silly at this
point.

re: "Clustered" vs Independent/Manual. The thing I don't like about that is
that you typically have clusters in both modes. I think the key distinction
is whether Solr "manages" the cluster automatically for you or whether you
manage it manually yourself.

What do you think about:
Alt E: "Managed Clustering" vs. "Unmanaged Clustering" Mode
Alt F:  "Managed Clustering" vs. "Manual Clustering" Mode
?

I think I prefer option F.

Trey Grainger
Founder, Searchkernel
https://searchkernel.com

On Thu, Jun 18, 2020 at 5:59 PM Jan Høydahl  wrote:

> I support Mike Drob and Trey Grainger. We shuold re-use the leader/replica
> terminology from Cloud. Even if you hand-configure a master/slave cluster
> and orchestrate what doc goes to which node/shard, and hand-code your
> shards
> parameter, you will still have a cluster where you’d send updates to the
> leader of
> each shard and the replicas would replicate the index from the leader.
>
> Let’s instead find a new good name for the cluster type. Standalone kind
> of works
> for me, but I see it can be confused with single-node. We have also
> discussed
> replacing SolrCloud (which is a terrible name) with something more
> descriptive.
>
> Today: SolrCloud vs Master/slave
> Alt A: SolrCloud vs Standalone
> Alt B: SolrCloud vs Legacy
> Alt C: Clustered vs Independent
> Alt D: Clustered vs Manual mode
>
> Jan
>
> > 18. jun. 2020 kl. 15:53 skrev Mike Drob :
> >
> > I personally think that using Solr cloud terminology for this would be
> fine
> > with leader/follower. The leader is the one that accepts updates,
> followers
> > cascade the updates somehow. The presence of ZK or election doesn’t
> really
> > change this detail.
> >
> > However, if folks feel that it’s confusing, then I can’t tell them that
> > they’re not confused. Especially when they’re working with others who
> have
> > less Solr experience than we do and are less familiar with the
> intricacies.
> >
> > Primary/Replica seems acceptable. Coordinator instead of Overseer seems
> > acceptable.
> >
> > Would love to see this in 9.0!
> >
> > Mike
> >
> > On Thu, Jun 18, 2020 at 8:25 AM John Gallagher
> >  wrote:
> >
> >> While on the topic of renaming roles, I'd like to propose finding a
> better
> >> term than "overseer" which has historical slavery connotations as well.
> >> Director, perhaps?
> >>
> >>
> >> John Gallagher
> >>
> >> On Thu, Jun 18, 2020 at 8:48 AM Jason Gerlowski 
> >> wrote:
> >>
> >>> +1 to rename master/slave, and +1 to choosing terminology distinct
> >>> from what's used for SolrCloud.  I could be happy with several of the
> >>> proposed options.  Since a good few have been proposed though, maybe
> >>> an eventual vote thread is the most organized way to aggregate the
> >>> opinions here.
> >>>
> >>> I'm less positive about the prospect of changing the name of our
> >>> primary git branch.  Most projects that contributors might come from,
> >>> most tutorials out there to learn git, most tools built on top of git
> >>> - the majority are going to assume "master" as the main branch.  I
> >>> appreciate the change that Github is trying to effect in changing the
> >>> default for new projects, but it'll be a long time before that
> >>> competes with the huge bulk of projects, documentation, etc. out there
> >>> using "master".  Our contributors are smart and I'm sure they'd figure
> >>> it out if we used "main" or something else instead, but having a
> >>> non-standard git setup would be one more "papercut" in

[PSA] Activate 2019 Call for Speakers ends May 8

2019-05-04 Thread Trey Grainger
Hi everyone,

I wanted to do a quick PSA for anyone who may have missed the announcement
last month to let you know the call for speakers is currently open
through *Wednesday,
May 8th*, for Activate 2019 (the Search and AI Conference), focused on the
Apache Solr ecosystem and the intersection of Search and AI:
https://lucidworks.com/2019/04/02/activate-2019-call-for-speakers/

The Activate Conference will be held September 9-12 in Washington, D.C.

The conference, rebranded last year from "Lucene/Solr Revolution", is
expected to grow considerably this year, and I'd like to encourage all of
you working on advancements in the Lucene/Solr project or working on
solving interesting problems in this space to consider submitting a talk if
you haven't already. There are tracks dedicated to Solr Development,
AI-powered Search, Search Development at Scale, and numerous other related
topics - including tracks for key use cases like digital commerce - that I
expect most on this list will find appealing.

If you're interested in presenting (your conference registration fee will
be covered if accepted), please submit a talk here:
https://activate-conf.com/speakers/

Just wanted to make sure everyone in the development and user community
here was aware of the conference and didn't miss the opportunity to submit
a talk by Wednesday if interested.

All the best,

Trey Grainger
Chief Algorithms Officer @ Lucidworks
https://www.linkedin.com/in/treygrainger/