Use multiple fields and you get what you want. The extra fields are going
to cost very little and will have a bit positive impact.
On Wed, Feb 15, 2012 at 9:30 AM, Jamie Johnson wrote:
> I think it would if I indexed the time information separately. Which
> was my original thought, but I was h
In general this kind of function is very easy to construct using sums of basic
sigmoidal functions. The logistic and probit functions are commonly used for
this.
Sent from my iPhone
On Feb 14, 2012, at 10:05, Mark wrote:
> Thanks I'll have a look at this. I should have mentioned that the act
Add this as well:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.155.5030
On Wed, Feb 8, 2012 at 1:56 AM, Andrzej Bialecki wrote:
> On 08/02/2012 09:17, Ted Dunning wrote:
>
>> This is true with Lucene as it stands. It would be much faster if there
>> were a spe
This is true with Lucene as it stands. It would be much faster if there
were a specialized in-memory index such as is typically used with high
performance search engines.
On Tue, Feb 7, 2012 at 9:50 PM, Lance Norskog wrote:
> Experience has shown that it is much faster to run Solr with a small
omments
> stand. The main point is that mixing all the *results* of the
> analysis chains for multiple languages into a single field
> will likely result in "interesting" behavior. Not to say it won't
> be satisfactory in your situation, but there are edge cases.
>
>
Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
> Solr Training - www.solrtraining.com
>
> On 20. jan. 2012, at 18:15, Ted Dunning wrote:
>
> > I think you misunderstood what I am suggesting.
> >
> > I am suggesting an analyzer that detects the languag
I think you misunderstood what I am suggesting.
I am suggesting an analyzer that detects the language and then "does the
right thing" according to the language it finds. As such, it would
tokenize and stem English according to English rules, German by German
rules and would probably do a sliding
> http://sematext.com/spm/solr-performance-monitoring/index.html
>
>
> - Original Message -
> > From: nibing
> > To: solr-user@lucene.apache.org
> > Cc:
> > Sent: Friday, January 20, 2012 1:51 AM
> > Subject: RE: Tika0.10 language identifier in Solr3.5
Write a tokenizer that does language ID and then picks which tokenizer to
use. Then record the language in the language id field.
What is there to elaborate?
On Fri, Jan 20, 2012 at 1:58 AM, nibing wrote:
> But then there occurs a problem of using analyzer in indexing. I assume
> files encoded
ent is that adding 1 SSD to each server and using it as a
> cache (more precisely as cache expansion to the cache already in RAM) will
> give you the best price/performance benefit of all options you have.
>
>
>
> Does this clarify things? Was I able to answer your question?
>
Actually, for search applications there is a reasonable amount of evidence
that holding the index in RAM is actually more cost effective than SSD's
because the throughput is enough faster to make up for the price
differential. There are several papers out of UMass that describe this
trade-off, alt
Normally this is done by putting a field on each document rather than
separating the documents into separate corpora. Keeping them together
makes the final search faster.
At query time, you can add all of the language keys that you think are
relevant based on your language id applied to the query
Take a look at openTSDB.
You might want to use that as is, or steal some of the concepts. The major
idea to snitch is the idea of using a single row of hte data base (document
in Lucene or Solr) to hold many data points.
Thus, you could consider having documents with the following fields:
key:
Peter,
My guess is that if you had said something along the lines of "We have
developed some SSD support software that makes SOLR work better. I would
like to open a conversation here (link to external discussion)" that would
have been reasonably well received. One of the things that makes SPAM
On Thu, Jan 19, 2012 at 1:40 AM, Darren Govoni wrote:
> And to be honest, many people on this list are professionals who not only
> build their own solutions, but also buy tools and tech.
>
> I don't see what the big deal is if some clever company has something of
> imminent value here to share i
I think the OP meant to use random order in the case of score ties.
On Wed, Jan 11, 2012 at 9:31 PM, Erick Erickson wrote:
> Alexandre:
>
> Have you thought about grouping? If you can analyze the incoming
> documents and include a field such that "similar" documents map
> to the same value, than
less than trying to engineer all possible
rewrites by hand.
On Tue, Jan 10, 2012 at 10:21 PM, Tanner Postert
wrote:
> You mention "that is one way to do it" is there another i'm not seeing?
>
> On Jan 10, 2012, at 4:34 PM, Ted Dunning wrote:
>
> > On Tue, J
On Tue, Jan 10, 2012 at 5:32 PM, Tanner Postert wrote:
> We've had some issues with people searching for a document with the
> search term '200 movies'. The document is actually title 'two hundred
> movies'.
>
> Do we need to add every number to our synonyms dictionary to
> accomplish this?
Tha
On Sun, Jan 8, 2012 at 3:33 PM, Michael Lissner <
mliss...@michaeljaylissner.com> wrote:
> I have a unique use case where I have words in my corpus that users
> shouldn't ever be allowed to search for. My theory is that if I add these
> to the stopwords list, that should do the trick.
>
That shou
Option 3 is preferably because you can use phrase queries to get
interesting results as in "color light beige" or "color light".
Normalizing is bad in this kind of environment.
On Sun, Jan 8, 2012 at 11:35 AM, jimmy wrote:
> ...
> First Table KEYWORDS:
> keyword_id, keyword
> 1, white horse
> 2
This copying is a bit overstated here because of the way that small
segments are merged into larger segments. Those larger segments are then
copied much less often than the smaller ones.
While you can wind up with lots of copying in certain extreme cases, it is
quite rare. In particular, if you
On Thu, Dec 22, 2011 at 7:02 AM, Zoran | Bax-shop.nl <
zoran.bi...@bax-shop.nl> wrote:
> Hello,
>
> What are (ballpark figure) the hardware requirement (diskspace, memory)
> SOLR will use i this case:
>
>
> * Heavy Dutch traffic webshop, 30.000 - 50.000 visitors a day
>
Unique users doesn
illions of users,
> and a rough estimation for each user's data would be something around
> 5 MB.
>
> The other problem is that those data will be changed very often.
>
> I hope I answered your question.
>
> Thanks
>
> On Tue, Dec 20, 2011 at 4:00 PM, Ted Dunning
You didn't mention how big your data is or how you create it.
Hadoop would mostly used in the preparation of the data or the off-line
creation of indexes.
On Tue, Dec 20, 2011 at 12:28 PM, Alireza Salimi
wrote:
> Hi,
>
> I have a basic question, let's say we're going to have a very very huge set
We still disagree.
On Fri, Dec 16, 2011 at 12:29 PM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:
> Ted,
>
> The list would be unreadable if everyone spammed at the bottom their
> email like Otis'. It's just bad form.
>
> Jason
>
> On Fri, Dec 16
Sounds like we disagree.
On Fri, Dec 16, 2011 at 11:56 AM, Jason Rutherglen <
jason.rutherg...@gmail.com> wrote:
> Ted,
>
> "...- FREE!" is stupid idiot spam. It's annoying and not suitable.
>
> On Fri, Dec 16, 2011 at 11:45 AM, Ted Dunning
> wrote:
>
I thought it was slightly clumsy, but it was informative. It seemed like a
fine thing to say. Effectively it was "I/we have developed a tool that
will help you solve your problem". That is responsive to the OP and it is
clear that it is a commercial deal.
On Fri, Dec 16, 2011 at 10:02 AM, Jason
Here is a talk I did on this topic at HPTS a few years ago.
On Thu, Dec 15, 2011 at 4:28 PM, Robert Petersen wrote:
> I see there is a lot of discussions about "micro-sharding", I'll have to
> read them. I'm on an older version of solr and just use master index
> replicating out to a farm of sl
On Mon, Dec 5, 2011 at 3:28 PM, Shawn Heisey wrote:
> On 12/4/2011 12:41 AM, Ted Dunning wrote:
>
>> Read the papers I referred to. They describe how to search fairly
>> enormous
>> corpus with an 8GB in-memory index (and no disk cache at all).
>>
>
> The
Sax is attractive, but I have found it lacking in practice. My primary
issue is that in order to get sufficient recall for practical matching
problems, I had to do enough query expansion that the speed advantage of
inverted indexes went away.
The OP was asking for blob storage, however, and I thi
On Sat, Dec 3, 2011 at 6:36 PM, Shawn Heisey wrote:
> On 12/3/2011 2:25 PM, Ted Dunning wrote:
>
>> Things have changed since I last did this sort of thing seriously. My
>> guess is that this is a relatively small amount of memory to devote to
>> search. It used to be
On Sat, Dec 3, 2011 at 10:54 AM, Shawn Heisey wrote:
> In another thread, something was said that sparked my interest:
>
> On 12/1/2011 7:17 PM, Ted Dunning wrote:
>
>> Of course, resharding is almost never necessary if you use micro-shards.
>> Micro-shards are shards s
> event if you plan carefully, and I think many will be able to handle the
> cost of splitting (you might even mark the replica you are splitting on so
> that it's not part of queries while its 'busy' splitting).
> >
> > - Mark
> >
> > On Dec 1, 2011, at
Well, this goes both ways.
It is not that unusual to take a node down for maintenance of some kind or
even to have a node failure. In that case, it is very nice to have the
load from the lost node be spread fairly evenly across the remaining
cluster.
Regarding the cost of having several micro-sh
Of course, resharding is almost never necessary if you use micro-shards.
Micro-shards are shards small enough that you can fit 20 or more on a
node. If you have that many on each node, then adding a new node consists
of moving some shards to the new machine rather than moving lots of little
docum
o it for me! that I have to as this
> question is probably not a good sign, but what is LSH clustering?
>
> On Fri, Nov 25, 2011 at 4:34 AM, Ted Dunning
> wrote:
>
> > You can do that pretty easily by just retrieving extra documents and post
> > processing the results
You can do that pretty easily by just retrieving extra documents and post
processing the results list.
You are likely to have a significant number of apparent duplicates this
way.
To really get rid of duplicates in results, it might be better to remove
them from the corpus by deploying something
Google achieves their results by using data not found in the web pages
themselves. This additional data critically includes link text, but also
is derived from behavioral information.
On Sat, Nov 5, 2011 at 5:07 PM, wrote:
> Hi Erick,
>
> The term "newspaper latimes" is not found in latimes.
That sounds like Nagle's algorithm.
http://en.wikipedia.org/wiki/Nagle's_algorithm#Interactions_with_real-time_systems
On Sun, Oct 30, 2011 at 2:01 PM, wrote:
> Another interesting note. When I use the Solr Admin screen to perform the
> same query, it doesn't take as long. Only when using SolrJ
On Thu, Oct 27, 2011 at 7:13 AM, Anatoli Matuskova <
anatoli.matusk...@gmail.com> wrote:
> I don't like the idea of indexing a doc per each value, the dataset can
> grow
> a lot.
What does a lot mean? How high is the sky?
A million people with 3 year schedules is a billion tiny documents.
Tha
On Thu, Oct 13, 2011 at 1:37 PM, wrote:
>
> Hi,
> I have some questions about the 4.0 solr cloud implementation.
>
> 1. I want to have a large cloud of machines on a network. each machine
> will process data and write to its "local" solr server (node,shard or
> whatever). This is necessary becau
On Tue, Oct 11, 2011 at 8:17 PM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:
> > In the case of using a shared (SAN) index between 2 masters, what happens
> if the
> > live master fails in such a way that the index remains "locked" (such
> > as if some hardware failure and it did not unl
On Tue, Oct 11, 2011 at 6:55 PM, Brandon Ramirez <
brandon_rami...@elementk.com> wrote:
> Using a shared volume crossed my mind too, but I discarded the idea because
> of literature I have read about Lucene performing poorly against remote file
> systems. But then I suppose a SAN wouldn't be a re
43 matches
Mail list logo