Re: Facet on TrieDateField field without including date

2012-02-15 Thread Ted Dunning
Use multiple fields and you get what you want. The extra fields are going to cost very little and will have a bit positive impact. On Wed, Feb 15, 2012 at 9:30 AM, Jamie Johnson wrote: > I think it would if I indexed the time information separately. Which > was my original thought, but I was h

Re: Need help with graphing function (MATH)

2012-02-14 Thread Ted Dunning
In general this kind of function is very easy to construct using sums of basic sigmoidal functions. The logistic and probit functions are commonly used for this. Sent from my iPhone On Feb 14, 2012, at 10:05, Mark wrote: > Thanks I'll have a look at this. I should have mentioned that the act

Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread Ted Dunning
Add this as well: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.155.5030 On Wed, Feb 8, 2012 at 1:56 AM, Andrzej Bialecki wrote: > On 08/02/2012 09:17, Ted Dunning wrote: > >> This is true with Lucene as it stands. It would be much faster if there >> were a spe

Re: is there any practice to load index into RAM to accelerate solr performance?

2012-02-08 Thread Ted Dunning
This is true with Lucene as it stands. It would be much faster if there were a specialized in-memory index such as is typically used with high performance search engines. On Tue, Feb 7, 2012 at 9:50 PM, Lance Norskog wrote: > Experience has shown that it is much faster to run Solr with a small

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-23 Thread Ted Dunning
omments > stand. The main point is that mixing all the *results* of the > analysis chains for multiple languages into a single field > will likely result in "interesting" behavior. Not to say it won't > be satisfactory in your situation, but there are edge cases. > >

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-20 Thread Ted Dunning
Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Solr Training - www.solrtraining.com > > On 20. jan. 2012, at 18:15, Ted Dunning wrote: > > > I think you misunderstood what I am suggesting. > > > > I am suggesting an analyzer that detects the languag

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-20 Thread Ted Dunning
I think you misunderstood what I am suggesting. I am suggesting an analyzer that detects the language and then "does the right thing" according to the language it finds. As such, it would tokenize and stem English according to English rules, German by German rules and would probably do a sliding

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-20 Thread Ted Dunning
> http://sematext.com/spm/solr-performance-monitoring/index.html > > > - Original Message - > > From: nibing > > To: solr-user@lucene.apache.org > > Cc: > > Sent: Friday, January 20, 2012 1:51 AM > > Subject: RE: Tika0.10 language identifier in Solr3.5

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-20 Thread Ted Dunning
Write a tokenizer that does language ID and then picks which tokenizer to use. Then record the language in the language id field. What is there to elaborate? On Fri, Jan 20, 2012 at 1:58 AM, nibing wrote: > But then there occurs a problem of using analyzer in indexing. I assume > files encoded

Re: How to accelerate your Solr-Lucene appication by 4x

2012-01-20 Thread Ted Dunning
ent is that adding 1 SSD to each server and using it as a > cache (more precisely as cache expansion to the cache already in RAM) will > give you the best price/performance benefit of all options you have. > > > > Does this clarify things? Was I able to answer your question? >

Re: How to accelerate your Solr-Lucene appication by 4x

2012-01-19 Thread Ted Dunning
Actually, for search applications there is a reasonable amount of evidence that holding the index in RAM is actually more cost effective than SSD's because the throughput is enough faster to make up for the price differential. There are several papers out of UMass that describe this trade-off, alt

Re: Tika0.10 language identifier in Solr3.5.0

2012-01-19 Thread Ted Dunning
Normally this is done by putting a field on each document rather than separating the documents into separate corpora. Keeping them together makes the final search faster. At query time, you can add all of the language keys that you think are relevant based on your language id applied to the query

Re: using solr for time series data

2012-01-19 Thread Ted Dunning
Take a look at openTSDB. You might want to use that as is, or steal some of the concepts. The major idea to snitch is the idea of using a single row of hte data base (document in Lucene or Solr) to hold many data points. Thus, you could consider having documents with the following fields: key:

Re: How to accelerate your Solr-Lucene appication by 4x

2012-01-19 Thread Ted Dunning
Peter, My guess is that if you had said something along the lines of "We have developed some SSD support software that makes SOLR work better. I would like to open a conversation here (link to external discussion)" that would have been reasonably well received. One of the things that makes SPAM

Re: How to accelerate your Solr-Lucene appication by 4x

2012-01-18 Thread Ted Dunning
On Thu, Jan 19, 2012 at 1:40 AM, Darren Govoni wrote: > And to be honest, many people on this list are professionals who not only > build their own solutions, but also buy tools and tech. > > I don't see what the big deal is if some clever company has something of > imminent value here to share i

Re: Relevancy and random sorting

2012-01-11 Thread Ted Dunning
I think the OP meant to use random order in the case of score ties. On Wed, Jan 11, 2012 at 9:31 PM, Erick Erickson wrote: > Alexandre: > > Have you thought about grouping? If you can analyze the incoming > documents and include a field such that "similar" documents map > to the same value, than

Re: Stemming numbers

2012-01-10 Thread Ted Dunning
less than trying to engineer all possible rewrites by hand. On Tue, Jan 10, 2012 at 10:21 PM, Tanner Postert wrote: > You mention "that is one way to do it" is there another i'm not seeing? > > On Jan 10, 2012, at 4:34 PM, Ted Dunning wrote: > > > On Tue, J

Re: Stemming numbers

2012-01-10 Thread Ted Dunning
On Tue, Jan 10, 2012 at 5:32 PM, Tanner Postert wrote: > We've had some issues with people searching for a document with the > search term '200 movies'. The document is actually title 'two hundred > movies'. > > Do we need to add every number to our synonyms dictionary to > accomplish this? Tha

Re: stopwords as privacy measure

2012-01-08 Thread Ted Dunning
On Sun, Jan 8, 2012 at 3:33 PM, Michael Lissner < mliss...@michaeljaylissner.com> wrote: > I have a unique use case where I have words in my corpus that users > shouldn't ever be allowed to search for. My theory is that if I add these > to the stopwords list, that should do the trick. > That shou

Re: complex keywords, hierarchical data, Solr representation problem

2012-01-08 Thread Ted Dunning
Option 3 is preferably because you can use phrase queries to get interesting results as in "color light beige" or "color light". Normalizing is bad in this kind of environment. On Sun, Jan 8, 2012 at 11:35 AM, jimmy wrote: > ... > First Table KEYWORDS: > keyword_id, keyword > 1, white horse > 2

Re: Solr Distributed Search vs Hadoop

2011-12-28 Thread Ted Dunning
This copying is a bit overstated here because of the way that small segments are merged into larger segments. Those larger segments are then copied much less often than the smaller ones. While you can wind up with lots of copying in certain extreme cases, it is quite rare. In particular, if you

Re: Hardware resource indication

2011-12-22 Thread Ted Dunning
On Thu, Dec 22, 2011 at 7:02 AM, Zoran | Bax-shop.nl < zoran.bi...@bax-shop.nl> wrote: > Hello, > > What are (ballpark figure) the hardware requirement (diskspace, memory) > SOLR will use i this case: > > > * Heavy Dutch traffic webshop, 30.000 - 50.000 visitors a day > Unique users doesn

Re: Solr Distributed Search vs Hadoop

2011-12-20 Thread Ted Dunning
illions of users, > and a rough estimation for each user's data would be something around > 5 MB. > > The other problem is that those data will be changed very often. > > I hope I answered your question. > > Thanks > > On Tue, Dec 20, 2011 at 4:00 PM, Ted Dunning

Re: Solr Distributed Search vs Hadoop

2011-12-20 Thread Ted Dunning
You didn't mention how big your data is or how you create it. Hadoop would mostly used in the preparation of the data or the off-line creation of indexes. On Tue, Dec 20, 2011 at 12:28 PM, Alireza Salimi wrote: > Hi, > > I have a basic question, let's say we're going to have a very very huge set

Re: Core overhead

2011-12-16 Thread Ted Dunning
We still disagree. On Fri, Dec 16, 2011 at 12:29 PM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote: > Ted, > > The list would be unreadable if everyone spammed at the bottom their > email like Otis'. It's just bad form. > > Jason > > On Fri, Dec 16

Re: Core overhead

2011-12-16 Thread Ted Dunning
Sounds like we disagree. On Fri, Dec 16, 2011 at 11:56 AM, Jason Rutherglen < jason.rutherg...@gmail.com> wrote: > Ted, > > "...- FREE!" is stupid idiot spam. It's annoying and not suitable. > > On Fri, Dec 16, 2011 at 11:45 AM, Ted Dunning > wrote: >

Re: Core overhead

2011-12-16 Thread Ted Dunning
I thought it was slightly clumsy, but it was informative. It seemed like a fine thing to say. Effectively it was "I/we have developed a tool that will help you solve your problem". That is responsive to the OP and it is clear that it is a commercial deal. On Fri, Dec 16, 2011 at 10:02 AM, Jason

Re: Core overhead

2011-12-15 Thread Ted Dunning
Here is a talk I did on this topic at HPTS a few years ago. On Thu, Dec 15, 2011 at 4:28 PM, Robert Petersen wrote: > I see there is a lot of discussions about "micro-sharding", I'll have to > read them. I'm on an older version of solr and just use master index > replicating out to a farm of sl

Re: Micro-Sharding

2011-12-05 Thread Ted Dunning
On Mon, Dec 5, 2011 at 3:28 PM, Shawn Heisey wrote: > On 12/4/2011 12:41 AM, Ted Dunning wrote: > >> Read the papers I referred to. They describe how to search fairly >> enormous >> corpus with an 8GB in-memory index (and no disk cache at all). >> > > The

Re: SolR for time-series data

2011-12-04 Thread Ted Dunning
Sax is attractive, but I have found it lacking in practice. My primary issue is that in order to get sufficient recall for practical matching problems, I had to do enough query expansion that the speed advantage of inverted indexes went away. The OP was asking for blob storage, however, and I thi

Re: Micro-Sharding

2011-12-03 Thread Ted Dunning
On Sat, Dec 3, 2011 at 6:36 PM, Shawn Heisey wrote: > On 12/3/2011 2:25 PM, Ted Dunning wrote: > >> Things have changed since I last did this sort of thing seriously. My >> guess is that this is a relatively small amount of memory to devote to >> search. It used to be

Re: Micro-Sharding

2011-12-03 Thread Ted Dunning
On Sat, Dec 3, 2011 at 10:54 AM, Shawn Heisey wrote: > In another thread, something was said that sparked my interest: > > On 12/1/2011 7:17 PM, Ted Dunning wrote: > >> Of course, resharding is almost never necessary if you use micro-shards. >> Micro-shards are shards s

Re: Configuring the Distributed

2011-12-01 Thread Ted Dunning
> event if you plan carefully, and I think many will be able to handle the > cost of splitting (you might even mark the replica you are splitting on so > that it's not part of queries while its 'busy' splitting). > > > > - Mark > > > > On Dec 1, 2011, at

Re: Configuring the Distributed

2011-12-01 Thread Ted Dunning
Well, this goes both ways. It is not that unusual to take a node down for maintenance of some kind or even to have a node failure. In that case, it is very nice to have the load from the lost node be spread fairly evenly across the remaining cluster. Regarding the cost of having several micro-sh

Re: Configuring the Distributed

2011-12-01 Thread Ted Dunning
Of course, resharding is almost never necessary if you use micro-shards. Micro-shards are shards small enough that you can fit 20 or more on a node. If you have that many on each node, then adding a new node consists of moving some shards to the new machine rather than moving lots of little docum

Re: remove answers with identical scores

2011-11-25 Thread Ted Dunning
o it for me! that I have to as this > question is probably not a good sign, but what is LSH clustering? > > On Fri, Nov 25, 2011 at 4:34 AM, Ted Dunning > wrote: > > > You can do that pretty easily by just retrieving extra documents and post > > processing the results

Re: remove answers with identical scores

2011-11-25 Thread Ted Dunning
You can do that pretty easily by just retrieving extra documents and post processing the results list. You are likely to have a significant number of apparent duplicates this way. To really get rid of duplicates in results, it might be better to remove them from the corpus by deploying something

Re: how to achieve google.com like results for phrase queries

2011-11-05 Thread Ted Dunning
Google achieves their results by using data not found in the web pages themselves. This additional data critically includes link text, but also is derived from behavioral information. On Sat, Nov 5, 2011 at 5:07 PM, wrote: > Hi Erick, > > The term "newspaper latimes" is not found in latimes.

Re: Query time help

2011-10-30 Thread Ted Dunning
That sounds like Nagle's algorithm. http://en.wikipedia.org/wiki/Nagle's_algorithm#Interactions_with_real-time_systems On Sun, Oct 30, 2011 at 2:01 PM, wrote: > Another interesting note. When I use the Solr Admin screen to perform the > same query, it doesn't take as long. Only when using SolrJ

Re: Search calendar avaliability

2011-10-27 Thread Ted Dunning
On Thu, Oct 27, 2011 at 7:13 AM, Anatoli Matuskova < anatoli.matusk...@gmail.com> wrote: > I don't like the idea of indexing a doc per each value, the dataset can > grow > a lot. What does a lot mean? How high is the sky? A million people with 3 year schedules is a billion tiny documents. Tha

Re: More new solr cloud questions

2011-10-13 Thread Ted Dunning
On Thu, Oct 13, 2011 at 1:37 PM, wrote: > > Hi, > I have some questions about the 4.0 solr cloud implementation. > > 1. I want to have a large cloud of machines on a network. each machine > will process data and write to its "local" solr server (node,shard or > whatever). This is necessary becau

Re: Replication with an HA master

2011-10-11 Thread Ted Dunning
On Tue, Oct 11, 2011 at 8:17 PM, Otis Gospodnetic < otis_gospodne...@yahoo.com> wrote: > > In the case of using a shared (SAN) index between 2 masters, what happens > if the > > live master fails in such a way that the index remains "locked" (such > > as if some hardware failure and it did not unl

Re: Replication with an HA master

2011-10-11 Thread Ted Dunning
On Tue, Oct 11, 2011 at 6:55 PM, Brandon Ramirez < brandon_rami...@elementk.com> wrote: > Using a shared volume crossed my mind too, but I discarded the idea because > of literature I have read about Lucene performing poorly against remote file > systems. But then I suppose a SAN wouldn't be a re