Re: Application of different stemmers / stopword lists within a single field

2014-04-25 Thread Otis Gospodnetic
Hi Tim, Step one is probably to detect language boundaries. You know your data. If they happen on paragraph breaks, your job will be easier. If they don't, a bit harder, but not impossible at all. I'm sure there is a ton of research on this topic out there, but the obvious approach would invol

Re: Search for a mask that matches the requested string

2014-04-25 Thread Otis Gospodnetic
Luwak is not based on the fork of Lucene or rather, the fork you are seeing is there only because the Luwak authors needed highlighting. If you don't need highlighting you can probably modify Luwak a bit to use regular Lucene. The Lucene fork you are seeing there will also, eventually, be committ

Re: Not allowing exact match with WordDelimiterFilterFactory

2014-04-25 Thread Kashish
Hi Jack, The autoGeneratePhraseQueries="true" for the text field type will make it take as a phrase every time right. I want it to take it as a phrase only when given within double quotes and otherwise not(i.e if the input is fast-five , search for fast,five,fast five,etc). For now i separated the

Re: Search for a mask that matches the requested string

2014-04-25 Thread Muhammad Gelbana
Luwak is based on a fork of solr\lucene which I cannot use. I have to do this using solr 4.6, whether by writing extra code or not. Thanks. *-* *Muhammad Gelbana* http://www.linkedin.com/in/mgelbana On Sat, Apr 26, 2014 at 12:13 AM, Ahmet Arslan wrote: > Hi, > > You don't n

Re: Search for a mask that matches the requested string

2014-04-25 Thread Ahmet Arslan
Hi, You don't need to write code for this. Use luwak (I gave the link in my first e-mail) instead. If your can't get luwak running because its too complicated etc, see a similar discussion  http://find.searchhub.org/document/9411388c7d2de701#36e50082e918b10c where diy-percolator example point

Re: Search for a mask that matches the requested string

2014-04-25 Thread Muhammad Gelbana
@Jack, I am ready to write custom code to implement such feature but I don't know what feature in solr should I extend ? Where should I start ? I believe it should be a very simple task. @Ahmet, how can I use the class you mentioned ? Is there a tutorial for it ? I'm not sure how the code in the c

Re: TB scale

2014-04-25 Thread Shawn Heisey
On 4/25/2014 1:48 PM, Ed Smiley wrote: > Anyone with experience, suggestions or lessons learned in the 10 -100 TB > scale they'd like to share? > Researching optimum design for a Solr Cloud with, say, about 20TB index. You've gotten some good information already in the replies that have come your

Re: Solr data directory contains index backups

2014-04-25 Thread solr2020
Thanks Greg. Is there any Solr configuration to do this periodically if any unused index copy or snapshot exists in data directory.? -- View this message in context: http://lucene.472066.n3.nabble.com/Solr-data-directory-contains-index-backups-tp4132590p4133221.html Sent from the Solr - User ma

Re: TB scale

2014-04-25 Thread Yonik Seeley
How many documents? That can be just as important (often more important) than total index size. Some other details, like the types of requests, would be helpful (i.e. what the index will be used for... the latency requirements of requests, if you will be faceting, etc). -Yonik http://heliosearch.

Re: TB scale

2014-04-25 Thread Ed Smiley
Not looking for a cookbook. Just curious to hear some war stories since this is relatively rare. ‹Ed :) -- Ed Smiley, Senior Software Architect, Ebooks ProQuest | 161 Evelyn Ave. | Mountain View, CA 94041 USA | +1 640 475 8700 ext. 3772 ed.smi...@proquest.com www.proquest.com

Re: SolrCloud load balancing during heavy indexing

2014-04-25 Thread Otis Gospodnetic
Hi, On Fri, Apr 25, 2014 at 12:54 PM, zzT wrote: > Erick Erickson wrote > > Back up, you're misunderstanding the update process. A leader node > > distributes the update to every replica. So _all_ your nodes in a > > slice are indexing when _any_ of them index. So the idea of sending > > queries

Re: TB scale

2014-04-25 Thread Otis Gospodnetic
Hi Ed, Unfortunately, there is no good *general* advice, so you'd need to provide a lot more detail to get useful help. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Fri, Apr 25, 2014 at 3:48 PM, Ed Smiley wrote: > Any

Re: Not allowing exact match with WordDelimiterFilterFactory

2014-04-25 Thread Jack Krupansky
Generally, if you are using the word delimiter filter, you need to have separate index and query analyzers and set preserverOriginal="true" for the index analyzer, but set preserveOriginal="false" for the query analyzer. Also, set autoGeneratePhraseQueries="true" for your text field type. -- J

Re: TB scale

2014-04-25 Thread Jack Krupansky
Also take a look at using DataStax Enterprise for managing large distributed databases, using Cassandra for the system of record data storage and Solr for indexing and search. See: http://www.datastax.com/what-we-offer/products-services/datastax-enterprise How many documents is your 20TB? --

Re: Search for a mask that matches the requested string

2014-04-25 Thread Ahmet Arslan
Hi, Your use case is different than ad hoc retrieval. Where you have set of documents and varying queries. In your case it is the reverse, you have a query (string masks) stored A?, and incoming documents are percolated against it. out of the box Solr does not have support for this t

Re: Search for a mask that matches the requested string

2014-04-25 Thread Jack Krupansky
No, neither Lucene nor Solr provide a "mask match" feature. You could write custom code to emulate such a feature. Elasticsearch appears to have done that with its "percolate" feature: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-percolate.html -- Jack Krupansky

Re: Search for a mask that matches the requested string

2014-04-25 Thread Muhammad Gelbana
I have no idea how can this help me. I have been using solr for a few weeks and I'm not familiar with it yet. I'm asking for a very simple task, a way to customize how solr matches a string, does this exist in solr ? *-* *Muhammad Gelbana* http://www.linkedin.com/in/mgelbana

TB scale

2014-04-25 Thread Ed Smiley
Anyone with experience, suggestions or lessons learned in the 10 -100 TB scale they'd like to share? Researching optimum design for a Solr Cloud with, say, about 20TB index. - Thanks Ed Smiley, Senior Software Architect, Ebooks ProQuest | 161 Evelyn Ave. | Mountain View, CA 94041 USA | +1 640 475

Not allowing exact match with WordDelimiterFilterFactory

2014-04-25 Thread Kashish
Hi, I am having some problem with WordDelimiterFilterFactory. This is my fieldType So now, If i search for a word like fast-five across this field, the debug shows me (((titleName:fast-five)^20 OR (akaName:fast-five)^10)) (((titl

Core "creation" and "reload"

2014-04-25 Thread Prashant Golash
Hi, I was just wondering between the two actions "creation" and "reload" of core and have some doubts. Hope to clarify it. a) *Reload core* - Happens when we explicitly "reloads the core". If reloading fails (due to malformed config like elevate.xml etc), then solr keep on serving request from ol

Re: DIH issues with 4.7.1

2014-04-25 Thread Shawn Heisey
On 4/25/2014 11:56 AM, Hutchins, Jonathan wrote: I recently upgraded from 4.6.1 to 4.7.1 and have found that the DIH process that we are using takes 4x as long to complete. The only odd thing I notice is when I enable debug logging for the dataimporthandler process, it appears that in the new ve

Re: DIH issues with 4.7.1

2014-04-25 Thread Alan Woodward
Hi Jonathan, It's a known bug: https://issues.apache.org/jira/browse/SOLR-5954. It'll be fixed in 4.8, which is being voted on now. Alan Woodward www.flax.co.uk On 25 Apr 2014, at 18:56, Hutchins, Jonathan wrote: > I recently upgraded from 4.6.1 to 4.7.1 and have found that the DIH > process

Re: SolrCloud load balancing during heavy indexing

2014-04-25 Thread Erick Erickson
What Shawn said Erick On Fri, Apr 25, 2014 at 10:13 AM, Shawn Heisey wrote: > On 4/25/2014 10:54 AM, zzT wrote: >> >> So, this is where SolrCloud is different from legacy master/slave >> configuration? I mean master/slave sends segments to the slaves using e.g. >> rsync while SolrCloud forwa

Re: Solr Cluster management having too many cores

2014-04-25 Thread Erick Erickson
So you're talking about 700 or so collections. That should be do-able, especially as Solr is rapidly evolving to handle more and more collections and there's two years for that to happen. The aging out bit is manual (well, you'd script it I suppose). So every day there'd be a script that ran and "

How to optimize a DisMax of multiple cachable queries?

2014-04-25 Thread Gregg Donovan
Some of our site's categories are actually search driven. They're created manually by crafting a Solr query our of list of Lucene queries that are joined in a DisjunctionMaxQuery. There are often 100+ disjuncts in this query. This works nicely, but is much slower than it could be because many part

DIH issues with 4.7.1

2014-04-25 Thread Hutchins, Jonathan
I recently upgraded from 4.6.1 to 4.7.1 and have found that the DIH process that we are using takes 4x as long to complete. The only odd thing I notice is when I enable debug logging for the dataimporthandler process, it appears that in the new version each sql query is resulting in a new connecti

Re: SolrCloud load balancing during heavy indexing

2014-04-25 Thread Shawn Heisey
On 4/25/2014 10:54 AM, zzT wrote: So, this is where SolrCloud is different from legacy master/slave configuration? I mean master/slave sends segments to the slaves using e.g. rsync while SolrCloud forwards the indexing request to replicas where it's processed "locally" on each replica, right? T

Re: SolrCloud load balancing during heavy indexing

2014-04-25 Thread zzT
Erick Erickson wrote > Back up, you're misunderstanding the update process. A leader node > distributes the update to every replica. So _all_ your nodes in a > slice are indexing when _any_ of them index. So the idea of sending > queries to just the replicas to avoid performance problems isn't > re

Re: Solr Cluster management having too many cores

2014-04-25 Thread Mukesh Jha
Thanks for quick reply Erik, I want to keep my collections till I run out of hardware, which is at least a couple of years worth data. I'd like to know more on ageing out aliases, did a quick search but didn't find much. On Fri, Apr 25, 2014 at 9:45 PM, Erick Erickson wrote: > Hmmm, tell us a l

Re: Adding replicas to an existing SolrCLoud collection

2014-04-25 Thread Erick Erickson
There are a couple of options here. The collections ADDREPLICA lets you specify the node you want the replica added to. You can specify the shard, collection, and node. Have you tried just adding replicas to the collection? I _think_ that it'll try to put new nodes on under-utilized machines, but

Re: Application of different stemmers / stopword lists within a single field

2014-04-25 Thread Erick Erickson
Solr doesn't have such capabilities built in that I know of. There are various language-recognition tools out there that you could potentially fire the extracted text blocks at and get something back, but extracting the text blocks would be a custom step on your part... Hmmm, if you can solve the

Re: Solr Cluster management having too many cores

2014-04-25 Thread Erick Erickson
Hmmm, tell us a little more about your use-case. In particular, how long do you need to keep the data around? Days? Months? Years? Because if you only need to keep the data for a specified period, you can use the collection aliasing process to age-out collections and keep the number of cores from

Re: SolrCloud load balancing during heavy indexing

2014-04-25 Thread Erick Erickson
Back up, you're misunderstanding the update process. A leader node distributes the update to every replica. So _all_ your nodes in a slice are indexing when _any_ of them index. So the idea of sending queries to just the replicas to avoid performance problems isn't relevant. In order to support NR

Re: dynamic field assignments

2014-04-25 Thread Jack Krupansky
Solr only supports mapping of values to field names, not mapping to field types. Field names are then mapped to field types. DynamicField only supports prefix OR suffix wildcard, not both in the same pattern. In the future, please take care to design your data model with the features and lim

Re: Adding replicas to an existing SolrCLoud collection

2014-04-25 Thread Ugo Matrangolo
Yeah I could do that but I was hoping in something less hacky :p On Fri, Apr 25, 2014 at 3:08 PM, Shawn Heisey wrote: > On 4/25/2014 3:34 AM, Ugo Matrangolo wrote: > > I have a running Solr 4.7.1 collection with a single shard replicated > over > > 7 nodes. This collection has been created usin

Re: Adding replicas to an existing SolrCLoud collection

2014-04-25 Thread Shawn Heisey
On 4/25/2014 3:34 AM, Ugo Matrangolo wrote: > I have a running Solr 4.7.1 collection with a single shard replicated over > 7 nodes. This collection has been created using a replicationFactor=7. The > idea was to replicate it on all the available nodes (it is a high > throughput collection). > > Re

Solr Cluster management having too many cores

2014-04-25 Thread Mukesh Jha
Hi Experts, I need to divide my indexes based on hour/day with each index having ~50-80 GB data & ~50-80 mill docs, so I'm planning to create daily collection with names e.g. *sample_colledction__mm_dd_hh.* I'll also create an alias *sample_collection* and update it whenever I will create a ne

Re: dynamic field assignments

2014-04-25 Thread John Thorhauer
Jack, Thanks for your help. > Reading your last paragraph, how is that any different than exactly what > DynamicField actually does? My understanding is that DynamicField can do something like FOO_BAR_TEXT_* but what I really need is *_TEXT_* as I might have FOO_BAR_TEXT_1 but I also might have

Re: dynamic field assignments

2014-04-25 Thread Jack Krupansky
Reading your last paragraph, how is that any different than exactly what DynamicField actually does? You say you want to change fields at "run time" - what is "run time"? When exactly do your field names change? To be clear, field names do not change in Solr once the data is written to the ind

dynamic field assignments

2014-04-25 Thread John Thorhauer
I have a scenario where I would like dynamically assign incoming document fields to two different solr schema fieldTypes. One fieldType will be an exact match fieldType while the other will be a full text fieldType. I know that I can use the dynamicField to assign fields using the asterisk in a n

Application of different stemmers / stopword lists within a single field

2014-04-25 Thread Timothy Hill
This may not be a practically solvable problem, but the company I work for has a large number of lengthy mixed-language documents - for example, scholarly articles about Islam written in English but containing lengthy passages of Arabic. Ideally, we would like users to be able to search both the En

SolrCloud load balancing during heavy indexing

2014-04-25 Thread zzT
Hi all, In SolrCloud all nodes are equal in the sense that they can perform indexing as well as searching. Let's say a leader node is busy performing heavy-indexing, I wouldn't like to also send search requests to that node. As far as I can tell from CloudSolrServer source code, all it does when

Re: StopFilter:enablePositionIncrements question

2014-04-25 Thread ku3ia
Hi, Ahmet! Thanks for your reply. I understand, that it is ok. And one more question, based on https://issues.apache.org/jira/browse/LUCENE-4963 >>We have some TokenFilters which are only broken with specific options. This includes: >>StopFilter, ..., LengthFilter when enablePositionIncrements=fals

Re: SpanQuery with Boolean Queries

2014-04-25 Thread Ahmet Arslan
Hi, I am not sure how OR clauses are executed.  But after re-reading your mail, I think you can use SpanOrQuery (for your q1) in your custom query parser plugin. val q2 = new SpanOrQuery(                         new SpanTermQuery(new Term("BookingRecordId", "ID_1")),                         new

Adding replicas to an existing SolrCLoud collection

2014-04-25 Thread Ugo Matrangolo
Hi, I have a running Solr 4.7.1 collection with a single shard replicated over 7 nodes. This collection has been created using a replicationFactor=7. The idea was to replicate it on all the available nodes (it is a high throughput collection). Recently I have added more nodes to house a different