Re: Solr feasibility with terabyte-scale data

2008-05-10 Thread Marcus Herou
Thanks Ken.

I will take a look be sure of that :)

Kindly

//Marcus

On Fri, May 9, 2008 at 10:26 PM, Ken Krugler <[EMAIL PROTECTED]>
wrote:

> Hi Marcus,
>
>  It seems a lot of what you're describing is really similar to MapReduce,
>> so I think Otis' suggestion to look at Hadoop is a good one: it might
>> prevent a lot of headaches and they've already solved a lot of the tricky
>> problems. There a number of ridiculously sized projects using it to solve
>> their scale problems, not least Yahoo...
>>
>
> You should also look at a new project called Katta:
>
> http://katta.wiki.sourceforge.net/
>
> First code check-in should be happening this weekend, so I'd wait until
> Monday to take a look :)
>
> -- Ken
>
>
>  On 9 May 2008, at 01:17, Marcus Herou wrote:
>>
>>  Cool.
>>>
>>> Since you must certainly already have a good partitioning scheme, could
>>> you
>>> elaborate on high level how you set this up ?
>>>
>>> I'm certain that I will shoot myself in the foot both once and twice
>>> before
>>> getting it right but this is what I'm good at; to never stop trying :)
>>> However it is nice to start playing at least on the right side of the
>>> football field so a little push in the back would be really helpful.
>>>
>>> Kindly
>>>
>>> //Marcus
>>>
>>>
>>>
>>> On Fri, May 9, 2008 at 9:36 AM, James Brady <[EMAIL PROTECTED]
>>> >
>>> wrote:
>>>
>>>  Hi, we have an index of ~300GB, which is at least approaching the
 ballpark
 you're in.

 Lucky for us, to coin a phrase we have an 'embarassingly partitionable'
 index so we can just scale out horizontally across commodity hardware
 with
 no problems at all. We're also using the multicore features available in
 development Solr version to reduce granularity of core size by an order
 of
 magnitude: this makes for lots of small commits, rather than few long
 ones.

 There was mention somewhere in the thread of document collections: if
 you're going to be filtering by collection, I'd strongly recommend
 partitioning too. It makes scaling so much less painful!

 James


 On 8 May 2008, at 23:37, marcusherou wrote:

  Hi.
>
> I will as well head into a path like yours within some months from now.
> Currently I have an index of ~10M docs and only store id's in the index
> for
> performance and distribution reasons. When we enter a new market I'm
> assuming we will soon hit 100M and quite soon after that 1G documents.
> Each
> document have in average about 3-5k data.
>
> We will use a GlusterFS installation with RAID1 (or RAID10) SATA
> enclosures
> as shared storage (think of it as a SAN or shared storage at least, one
> mount point). Hope this will be the right choice, only future can tell.
>
> Since we are developing a search engine I frankly don't think even
> having
> 100's of SOLR instances serving the index will cut it performance wise
> if
> we
> have one big index. I totally agree with the others claiming that you
> most
> definitely will go OOE or hit some other constraints of SOLR if you
> must
> have the whole result in memory sort it and create a xml response. I
> did
> hit
> such constraints when I couldn't afford the instances to have enough
> memory
> and I had only 1M of docs back then. And think of it... Optimizing a TB
> index will take a long long time and you really want to have an
> optimized
> index if you want to reduce search time.
>
> I am thinking of a sharding solution where I fragment the index over
> the
> disk(s) and let each SOLR instance only have little piece of the total
> index. This will require a master database or namenode (or simpler just
> a
> properties file in each index dir) of some sort to know what docs is
> located
> on which machine or at least how many docs each shard have. This is to
> ensure that whenever you introduce a new SOLR instance with a new shard
> the
> master indexer will know what shard to prioritize. This is probably not
> enough either since all new docs will go to the new shard until it is
> filled
> (have the same size as the others) only then will all shards receive
> docs
> in
> a loadbalanced fashion. So whenever you want to add a new indexer you
> probably need to initiate a "stealing" process where it steals docs
> from
> the
> others until it reaches some sort of threshold (10 servers = each shard
> should have 1/10 of the docs or such).
>
> I think this will cut it and enabling us to grow with the data. I think
> doing a distributed reindexing will as well be a good thing when it
> comes
> to
> cutting both indexing and optimizing speed. Probably each indexer
> should
> buffer it's shard locally on RAID1 SCSI disks, optimize it and then
> just
> copy it to 

Re: Solr feasibility with terabyte-scale data

2008-05-10 Thread Marcus Herou
Hi Otis.

Thanks for the insights. Nice to get feedback from a technorati guy. Nice to
see that the snippet of yours is almost a copy of mine, gives me the right
stomach feeling about this :)

I'm quite familiar with Hadoop as you can see if you check out the code of
my OS project AbstractCache->
http://dev.tailsweep.com/projects/abstractcache/. AbstractCache is a project
which aims to create storage solutions based on the Map and SortedMap
interface. I use it everywhere in Tailsweep.com and used it as well at my
former employer Eniro.se (largest yellow pages site in Sweden). It has been
in constant development for five years.

Since I'm a cluster freak of nature I love a project named GlusterFS where
thay have managed to create a system without master/slave[s] and NameNode.
The advantage of this is that it is a lot more scalable, the drawback is
that you can get into Split-Brain situations which guys in the mailing-list
are complaining about. Anyway I tend to try to solve this with JGroups
membership where the coordinator can be any machine in the cluster but in
the group joining process the first machine to join get's the privilege of
becoming coordinator. But even with JGroups you can run into trouble with
race-conditions of all kinds (distributed locks for example).

I've created an alternative to the Hadoop file system (mostly for fun) where
you just add an object to the cluster and based on what algorithm you choose
it is Raided or striped across the cluster.

Anyway this was off topic but I think my experience in building membership
aware clusters will help me in this particular case.

Kindly

//Mrcaus



On Fri, May 9, 2008 at 6:54 PM, Otis Gospodnetic <[EMAIL PROTECTED]>
wrote:

> Marcus,
>
> You are headed in the right direction.
>
> We've built a system like this at Technorati (Lucene, not Solr) and had
> components like the "namenode" or "controller" that you mention.  If you
> look at Hadoop project, you will see something similar in concept
> (NameNode), though it deals with raw data blocks, their placement in the
> cluster, etc.  As a matter of fact, I am currently running its "re-balancer"
> in order to move some of the blocks around in the cluster.  That matches
> what you are describing for moving documents from one shard to the other.
>  Of course, you can simplify things and just have this central piece be
> aware of any new servers and simply get it to place any new docs on the new
> servers and create a new shard there.  Or you can get fancy and take into
> consideration the hardware resources - the CPU, the disk space, the memory,
> and use that to figure out how much each machine in your cluster can handle
> and maximize its use based on this knowledge. :)
>
> I think Solr and Nutch are in a desperate need of this central component
> (must not be SPOF!) for shard management.
>
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>
> - Original Message 
> > From: marcusherou <[EMAIL PROTECTED]>
> > To: solr-user@lucene.apache.org
> > Sent: Friday, May 9, 2008 2:37:19 AM
> > Subject: Re: Solr feasibility with terabyte-scale data
> >
> >
> > Hi.
> >
> > I will as well head into a path like yours within some months from now.
> > Currently I have an index of ~10M docs and only store id's in the index
> for
> > performance and distribution reasons. When we enter a new market I'm
> > assuming we will soon hit 100M and quite soon after that 1G documents.
> Each
> > document have in average about 3-5k data.
> >
> > We will use a GlusterFS installation with RAID1 (or RAID10) SATA
> enclosures
> > as shared storage (think of it as a SAN or shared storage at least, one
> > mount point). Hope this will be the right choice, only future can tell.
> >
> > Since we are developing a search engine I frankly don't think even having
> > 100's of SOLR instances serving the index will cut it performance wise if
> we
> > have one big index. I totally agree with the others claiming that you
> most
> > definitely will go OOE or hit some other constraints of SOLR if you must
> > have the whole result in memory sort it and create a xml response. I did
> hit
> > such constraints when I couldn't afford the instances to have enough
> memory
> > and I had only 1M of docs back then. And think of it... Optimizing a TB
> > index will take a long long time and you really want to have an optimized
> > index if you want to reduce search time.
> >
> > I am thinking of a sharding solution where I fragment the index over the
> > disk(s) and let each SOLR instance only have little piece of the total
> > index. This will require a master database or namenode (or simpler just a
> > properties file in each index dir) of some sort to know what docs is
> located
> > on which machine or at least how many docs each shard have. This is to
> > ensure that whenever you introduce a new SOLR instance with a new shard
> the
> > master indexer will know what shard to prioritize. This is probably not
> > eno

Re: token concat filter?

2008-05-10 Thread Chris Hostetter

: I guess not.  I've been reading the wiki, but the trouble with wiki's always
: seems to be (for me) finding stuff.  can you point it out?

just search the wiki for "synonyms" (admitedly, the fact that by default 
it only searches titles and you have to click "text" to search the full 
wiki is anoying) and you'll find this...

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?highlight=%28synonyms%29

...and if you scroll down you'll wind up here...

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#SynonymFilter



-Hoss



Re: Multicore and SolrResourceLoader

2008-05-10 Thread Chris Hostetter

: I've been digging around in multicore and I am curious as to how to force a
: reload of the sharedLib classloader.  I can reload a given core, which
: instantiates a new SolrResourceLoader for that core, but I want to be able to
: reload the classloader for the sharedLib.

that seems really dangerous to me ... you could wind up changing class 
impls out from under a core ... which could give you serious 
incompatibilities.

The only safe way i can imagine doing this would be if we add a way to 
compeltely reinitialize the MultiCore (which would reload all the 
SolrCores)


-Hoss



Re: exceeded limit of maxWarmingSearchers

2008-05-10 Thread Chris Hostetter

: On a solr instance where I am in the process of indexing moderately large
: number of documents (300K+). There is no querying of the index taking place
: at all.
: I don't understand what operations are causing new searchers to warm, or how
: to stop them from doing so.  I'd be happy to provide more details of my
: configuration if necessary, I've made very few changes to the solrconfig.xml
: that comes with the sample application.

the one aspect that i didn't see mentioned in this thread so far is cache 
autowarming.

even if no querying is going on while you are doing all the indexing, if 
some querying took place at any point, and your caches have someentries in 
them.  every commit will cause autowarming of caches to happen (according 
to the autowarm settings on each cache) which result in queries getting 
executed on your "warming" searcher, and those queries keep cascading on 
to the subsequent warming searchers.

this is one of the reasosn why it's generlaly a good idea to have the 
cache sizes on your "master" boxes all have autowarm counts of "0".  you 
can still use the caches in case you do inadvertantly hit your master (you 
don't want it to fall over and die) but you don't want to waste a lot of 
time warming them on every commit until the end of time.

-Hoss



Re: Missing content Stream

2008-05-10 Thread Chris Hostetter

1) Posting the exact same question twice because you didn't get a reply in 
the first 8 hours isn't going to encourage people to reply faster. best 
case scenerio: you waste people's time they could be spending reading 
another email; worst case scnerio: you irk people and put them in a bad 
mood so they don't feel like being helpful.

2) In both instances of your question, the curl command you listed didn't 
include a space between the URL and the --data-binary option, perhaps 
that's your problem?

3) what *exactly* do you see in your shell when you run hte command?  you 
said the lines from books.csv get dumped to your console, but what appears 
above it? what appears below it? ... books.csv is only 10 lines, just 
paste it all from the line where you run the command to the next prompt 
you see.

4) FYI...

: Changed the following in solrconfig.xml,
: 
: 

a) You don't need to enableRemoteStreaming if you plan on posting your CSV 
files, that's only neccessary if you want to be able to use the 
stream.file or stream.url options.

b) the /update/csv handler should have already been in the example 
solrconfig.xml exactly as you list it there ... what do you mean you 
changed it?


-Hoss



Re: JSON updates?

2008-05-10 Thread Chris Hostetter

: I was wondering if xml is the only format used for updating Solr documents
: or can JSON or Ruby be used as well ?

Solr ships with two "RequestHandlers" for processing updates -- the 
XMLRequestHandler and the CSVRequestHandler.  some others are in Jira 
(RichDocumentUpdateHandler and DateImportRequestHandler) or you could 
implement new ones as plugins.


-Hoss



Re: Simple Solr POST using java

2008-05-10 Thread Chris Hostetter

: please post a snippet of Java code to add a document to the Solr index that
: includes the URL reference as a String?

you mean like this one...   :)

http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/util/SimplePostTool.java?view=markup

FWIW: if you want to talk to Solr from a Java app, the SolrJ client API 
is probably worth looking into rather then dealing with the HTTP 
connections and XML formating directly...

http://wiki.apache.org/solr/Solrj


-Hoss



Field Filtering that Varies per Query

2008-05-10 Thread Nathan Woodhull
I have an interesting query filter problem.

The application that I am working on contains a directory of user
profiles that acts a lot like a social networking site. Each user
profile is composed of a set of fields (first_name, last_name, bio,
phone_number, etc). Every user profile field is associated with a
privacy setting that optionally hides the data inside of the field
from other users. The privacy settings allow people to show the field
to nobody, only their contacts on the site, all logged in users, or
anyone.

This presents a problem while designing a search interface for the
profiles. All of the filtering options I have seen allow for per
document filtering, but that is not sufficient. Since users have the
option of selectively displaying portions of their profile to
different users, we need to be able to remove individual fields from
specific documents from consideration on a per query basis.

The only idea we have had for resolving this is to construct an
elaborate filter query to restrict the set of documents that the
actual search is performed upon, but it has problems:

We created a series of multivalue fields to store the user profile information

first_name_contact
last_name_contact
etc...

In these fields we stored the privacy preference: anonymous,
logged_in, or the set of contact ids that were allowed access.Then we
created a query filter that was dynamic depending on the identity of
the logged in user, that looked something like this for a query for
the term secret:

(first_name: secret AND (first_name_contact:anonymous OR
first_name_contact:member)) OR
(last_name: secret AND (last_name_contact:anonymous OR
last_name_contact:member)) OR

This was used as a filter query with a standard query for the search
term secret performed on the resulting filtered set of documents. This
worked great if the search was a single word. However, if the users'
search query contained multiple terms - for instance,  'my secret'
results might be inappropriately revealed. This is because matches
might occur for one term in a public field while the other term might
only exist in fields that are private to the user making the query.
Because documents would be allowed into the filtered set of potential
results in that case, they would be matched by the actual query. By
executing a set of queries, a user could infer the contents of a
protected document field even though they would be unable to view its
contents.

  We have been unable to think of a way to construct a query that
overcomes this issue. Looking briefly at Lucene, there does not seem
to be an obvious way to do the sort of field based filtering that
varies on a per query basis that we need to do, even if we were
willing to dig deeper and write some custom code. Does anyone know of
any tricks that we might use? Is it even possible to do this given how
the low level architecture of Lucene may or may not work?

Any help would be greatly appreciated.

Thanks,

Nathan Woodhull


Re: Simple Solr POST using java

2008-05-10 Thread climbingrose
Agree. I've been using Solrj on product site for 9 months without any
problem at all. You should probably give it a try instead of dealing with
all those low level details.


On Sun, May 11, 2008 at 4:14 AM, Chris Hostetter <[EMAIL PROTECTED]>
wrote:

>
> : please post a snippet of Java code to add a document to the Solr index
> that
> : includes the URL reference as a String?
>
> you mean like this one...   :)
>
>
> http://svn.apache.org/viewvc/lucene/solr/trunk/src/java/org/apache/solr/util/SimplePostTool.java?view=markup
>
> FWIW: if you want to talk to Solr from a Java app, the SolrJ client API
> is probably worth looking into rather then dealing with the HTTP
> connections and XML formating directly...
>
> http://wiki.apache.org/solr/Solrj
>
>
> -Hoss
>
>


-- 
Regards,

Cuong Hoang