Sharding and Replication

2013-06-19 Thread Asif
Hi,

I had questions on implementation of Sharding and Replication features of
Solr/Cloud.

1. I noticed that when sharding is enabled for a collection - individual
requests are sent to each node serving as a shard.

2. Replication too follows above strategy of sending individual documents
to the nodes serving as a replica.

I am working with a system that requires massive number of writes - I have
noticed that due to above reason - the cloud eventually starts to fail
(Even though I am using a ensemble).

I do understand the reason behind individual updates - but why not batch
them up or give a option to batch N updates in either of the above case - I
did come across a presentation that talked about batching 10 updates for
replication at least, but I do not think this is the case.
- Asif


Sharding and Replication clarification

2013-06-19 Thread Asif
Hi,

I had questions on implementation of Sharding and Replication features of
Solr/Cloud.

1. I noticed that when sharding is enabled for a collection - individual
requests are sent to each node serving as a shard.

2. Replication too follows above strategy of sending individual documents
to the nodes serving as a replica.

I am working with a system that requires massive number of writes - I have
noticed that due to above reason - the cloud eventually starts to fail
(Even though I am using a ensemble).

I do understand the reason behind individual updates - but why not batch
them up or give a option to batch N updates in either of the above case - I
did come across a presentation that talked about batching 10 updates for
replication at least, but I do not think this is the case.
- Asif


Re: Sharding and Replication

2013-06-21 Thread Asif
Erick,

Thanks for your reply.

You are right about 10 updates being batch up - It was hard to figure out
due to large number of updates/logging that happens in our system.

We are batching 1000 updates every time.

Here is my observation from leader and replica -

1. Leader logs are clearly indicating that 1000 updates arrived - [ (1000
adds)],commit=]
2. On replica - for each 1000 document adds on leader - I see a lot of
requests on replica - with no indication of how many updates in each
request.

Digging a little bit into Solr code  I figured this variable I am
interested in - maxBufferedAddsPerServer is set to 10 -

http://svn.apache.org/viewvc/lucene/dev/trunk/solr/core/src/java/org/apache/solr/update/SolrCmdDistributor.java?view=markup

This means for a batch update of 1000 documents - we will be seeing 100
requests for replica - which translates into 100 writes per collection per
second in our system.

Should this variable be made configurable via solrconfig.xml (or any other
appropriate place)?

A little background about a system we are trying to build - real time
analytics solution using the Solr Cloud + Atomic updates - we have very
high amount of writes - going as high as 1000 updates a second (possibly
more in long run).

- Asif





On Sat, Jun 22, 2013 at 4:21 AM, Erick Erickson wrote:

> Update are batched, but it's on a per-request basis. So, if
> you're sending one document at a time you'll won't get any
> batching. If you send 10 docs at a time and they happen to
> go to 10 different shards, you'll get 10 different update
> requests.
>
> If you're sending 1,000 docs per update you' should be seeing
> some batching going on.
>
> bq:  but why not batch them up or give a option to batch N
> updates in either of the above case
>
> I suspect what you're seeing is that you're not sending very
> many docs per update request and so are being mislead.
>
> But that's a guess since you haven't provided much in the
> way of data on _how_ you're updating.
>
> bq: the cloud eventually starts to fail
> How? Details matter.
>
> Best
> Erick
>
> On Wed, Jun 19, 2013 at 4:23 AM, Asif  wrote:
> > Hi,
> >
> > I had questions on implementation of Sharding and Replication features of
> > Solr/Cloud.
> >
> > 1. I noticed that when sharding is enabled for a collection - individual
> > requests are sent to each node serving as a shard.
> >
> > 2. Replication too follows above strategy of sending individual documents
> > to the nodes serving as a replica.
> >
> > I am working with a system that requires massive number of writes - I
> have
> > noticed that due to above reason - the cloud eventually starts to fail
> > (Even though I am using a ensemble).
> >
> > I do understand the reason behind individual updates - but why not batch
> > them up or give a option to batch N updates in either of the above case
> - I
> > did come across a presentation that talked about batching 10 updates for
> > replication at least, but I do not think this is the case.
> > - Asif
>


Re: Sharding and Replication

2013-06-22 Thread Asif
Erick,

Its a completely practical problem - we are exploring Solr to build a real
time analytics/data solution for a system handling about 1000 qps. We have
various metrics that are stored as different collections on the cloud,
which means very high amount of writes. The cloud also needs to support
about 300-400 qps.

We initially tested with a single Solr node on a 16 core / 24 GB box  for a
single metric. We saw that writes were not a issue at all - Solr was
handling it extremely well. We were also able to achieve about 200 qps from
a single node.

When we set up the cloud ( a ensemble on 6 boxes), we saw very high CPU
usage on the replicas. Up to 10 cores were getting used for writes on the
replicas. Hence my concern with respect to batch updates for the replicas.

BTW, I altered the maxBufferedAddsPerServer to 1000 - and now CPU usage is
very similar to single node installation.

- Asif






On Sat, Jun 22, 2013 at 9:53 PM, Erick Erickson wrote:

> Yeah, there's been talk of making this configurable, but there are
> more pressing priorities so far.
>
> So just to be clear, is this theoretical or practical? I know of several
> very
> high-performance situations where 1,000 updates/sec (and I'm assuming
> that it's 1,000 docs/sec not 1,000 batches of 1,000 docs) hasn't caused
> problems here. So unless you're actually seeing performance problems
> as opposed to fearing that there _might_ be, I'd just go on the to the next
> urgent problem.
>
> Best
> Erick
>
> On Fri, Jun 21, 2013 at 8:34 PM, Asif  wrote:
> > Erick,
> >
> > Thanks for your reply.
> >
> > You are right about 10 updates being batch up - It was hard to figure out
> > due to large number of updates/logging that happens in our system.
> >
> > We are batching 1000 updates every time.
> >
> > Here is my observation from leader and replica -
> >
> > 1. Leader logs are clearly indicating that 1000 updates arrived - [ (1000
> > adds)],commit=]
> > 2. On replica - for each 1000 document adds on leader - I see a lot of
> > requests on replica - with no indication of how many updates in each
> > request.
> >
> > Digging a little bit into Solr code  I figured this variable I am
> > interested in - maxBufferedAddsPerServer is set to 10 -
> >
> >
> http://svn.apache.org/viewvc/lucene/dev/trunk/solr/core/src/java/org/apache/solr/update/SolrCmdDistributor.java?view=markup
> >
> > This means for a batch update of 1000 documents - we will be seeing 100
> > requests for replica - which translates into 100 writes per collection
> per
> > second in our system.
> >
> > Should this variable be made configurable via solrconfig.xml (or any
> other
> > appropriate place)?
> >
> > A little background about a system we are trying to build - real time
> > analytics solution using the Solr Cloud + Atomic updates - we have very
> > high amount of writes - going as high as 1000 updates a second (possibly
> > more in long run).
> >
> > - Asif
> >
> >
> >
> >
> >
> > On Sat, Jun 22, 2013 at 4:21 AM, Erick Erickson  >wrote:
> >
> >> Update are batched, but it's on a per-request basis. So, if
> >> you're sending one document at a time you'll won't get any
> >> batching. If you send 10 docs at a time and they happen to
> >> go to 10 different shards, you'll get 10 different update
> >> requests.
> >>
> >> If you're sending 1,000 docs per update you' should be seeing
> >> some batching going on.
> >>
> >> bq:  but why not batch them up or give a option to batch N
> >> updates in either of the above case
> >>
> >> I suspect what you're seeing is that you're not sending very
> >> many docs per update request and so are being mislead.
> >>
> >> But that's a guess since you haven't provided much in the
> >> way of data on _how_ you're updating.
> >>
> >> bq: the cloud eventually starts to fail
> >> How? Details matter.
> >>
> >> Best
> >> Erick
> >>
> >> On Wed, Jun 19, 2013 at 4:23 AM, Asif  wrote:
> >> > Hi,
> >> >
> >> > I had questions on implementation of Sharding and Replication
> features of
> >> > Solr/Cloud.
> >> >
> >> > 1. I noticed that when sharding is enabled for a collection -
> individual
> >> > requests are sent to each node serving as a shard.
> >> >
> >> > 2. Replication too follows above strategy of sending individual
> documents
> >> > to the nodes serving as a replica.
> >> >
> >> > I am working with a system that requires massive number of writes - I
> >> have
> >> > noticed that due to above reason - the cloud eventually starts to fail
> >> > (Even though I am using a ensemble).
> >> >
> >> > I do understand the reason behind individual updates - but why not
> batch
> >> > them up or give a option to batch N updates in either of the above
> case
> >> - I
> >> > did come across a presentation that talked about batching 10 updates
> for
> >> > replication at least, but I do not think this is the case.
> >> > - Asif
> >>
>


indexing documents in Apache Solr using php-curl library

2012-07-02 Thread Asif
I am indexing the file using php curl library. I am stuck here with the code
echo "Stored in: " . "upload/" . $_FILES["file"]["name"];
 $result=move_uploaded_file($_FILES["file"]["tmp_name"],"upload/" .
$_FILES["file"]["name"]);
 if ($result == 1) echo "Upload done .";
$options = getopt("f:");
$infile = $options['f'];

$url = "http://localhost:8983/solr/update/";;
$filename = "upload/" . $_FILES["file"]["name"];
$handle = fopen($filename, "rb");
$contents = fread($handle, filesize($filename));
fclose($handle);
echo $url;
$post_string = file_get_contents("upload/" .
$_FILES["file"]["name"]);
echo $contents;
$header = array("Content-type:text/xml; charset=utf-8");

$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_HTTPHEADER, $header);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_string);
curl_setopt($ch, CURLOPT_HTTP_VERSION, CURL_HTTP_VERSION_1_1);
curl_setopt($ch, CURLINFO_HEADER_OUT, 1);

$data = curl_exec($ch);

if (curl_errno($ch)) {
   print "curl_error:" . curl_error($ch);
} else {
   curl_close($ch);
   print "curl exited okay\n";
   echo "Data returned...\n";
   echo "\n";
   echo $data;
   echo "\n";
}

Nothing is showing as a result. Moreover there is nothing shown in the event
log of Apache Solr. please help me with the code

--
View this message in context: 
http://lucene.472066.n3.nabble.com/indexing-documents-in-Apache-Solr-using-php-curl-library-tp3992452.html
Sent from the Solr - User mailing list archive at Nabble.com.


Hiring solr experts

2010-08-07 Thread Asif Rahman
Hi all,

Does anyone here have any experience hiring solr experts?  Are there any
specific channels that you had good success with?

Thanks,

Asif

-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com


Re: Hiring solr experts

2010-08-07 Thread Asif Rahman
We're actually looking to bring someone on full-time.

On Sat, Aug 7, 2010 at 3:13 PM, Erick Erickson wrote:

> Well, what do you want them to do? Come work full-time for
> your company or consult/contract?
>
> If the latter, have you seen this?
> http://wiki.apache.org/solr/Support
>
> On Sat, Aug 7, 2010 at 4:59 AM, Asif Rahman  wrote:
>
> > Hi all,
> >
> > Does anyone here have any experience hiring solr experts?  Are there any
> > specific channels that you had good success with?
> >
> > Thanks,
> >
> > Asif
> >
> > --
> > Asif Rahman
> > Lead Engineer - NewsCred
> > a...@newscred.com
> > http://platform.newscred.com
> >
>



-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com


Facet filtering

2009-08-20 Thread Asif Rahman
Is there any way to assign metadata to terms in a field and then filter on
that metadata when using that field as a facet?

For example, I have a collection of news articles in my index.  Each article
has a field that contains tags based on the topics discussed in the
article.  An article might have the tags "Barack Obama" and "Chicago".  I
want to assign metadata describing what type of entity each tag is.  For
these tags the metadata would be "person" for "Barack Obama" and "place" for
"Chicago".  Then I want to issue a facet query that returns only "person"
facets.

I can think of two possible solutions for this, both with shortcomings.

1) I could create a field for each metadata category.  So the schema would
have the fields "tag_person" and "tag_place".  The problem with this method
is that I am limited to filtering by a single criterion for each of my
queries.

2) I could leave the Solr schema unmodified and post-process the query.
This solution is less elegant than one that could be completely contained
within Solr.  I also imagine that it would be less performant.

Any thoughts?

Thanks in advance,

Asif

-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com


Re: Facets with an IDF concept

2009-10-09 Thread Asif Rahman
Hi Wojtek:

Sorry for the late, late reply.  I haven't implemented this yet, but it is
on the (long) list of my todos.  Have you made any progress?

Asif

On Thu, Aug 13, 2009 at 5:42 PM, wojtekpia  wrote:

>
> Hi Asif,
>
> Did you end up implementing this as a custom sort order for facets? I'm
> facing a similar problem, but not related to time. Given 2 terms:
> A: appears twice in half the search results
> B: appears once in every search result
> I think term A is more "interesting". Using facets sorted by frequency,
> term
> B is more important (since it shows up first). To me, terms that appear in
> all documents aren't really that interesting. I'm thinking of using a
> combination of document count (in the result set, not globally) and term
> frequency (in the result set, not globally) to come up with a facet sort
> order.
>
> Wojtek
> --
> View this message in context:
> http://www.nabble.com/Facets-with-an-IDF-concept-tp24071160p24959192.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com


Incorrect sort with with function query in query parameters

2009-03-26 Thread Asif Rahman

Hi all,

I'm having an issue with the order of my results when attempting to sort by
a function in my query.  Looking at the debug output of the query, the score
returned with in the result section for any given document does not match
the score in the debug output.  It turns out that if I optimize the index,
then the results are sorted correctly.  The scores in the debug output are
the correct scores.  This behavior only occurs using a recent nightly build
of Solr.  It works correctly in Solr 1.3.

An example query is:

http://localhost:8080/solr/core-01/select?qt=standard&fl=*,score&rows=10&q=*:*%20_val_:"recip(rord(article_published_at),1,1000,1000)"^1&debugQuery=on

I've attached the result to this email.

Can anybody shed any light on this problem? 

Thanks,

Asif
http://www.nabble.com/file/p22735009/result.xml result.xml 
-- 
View this message in context: 
http://www.nabble.com/Incorrect-sort-with-with-function-query-in-query-parameters-tp22735009p22735009.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Incorrect sort with with function query in query parameters

2009-03-27 Thread Asif Rahman

Hi Otis,

Any documents marked deleted in this index are just the result of updates to
those documents.  There are no purely deleted documents.  Furthermore, the
field that I am ordering by in my function query remains untouched over the
updates.

I've read in other posts that the logic used by the debug component to
calculate the score is different from what the query component uses.  The
score shown in the debug output is correct.  It seems like the two
components are getting two different values for the rord function.

I'm particularly concerned by the fact that this only happens in the nightly
build.  Any ideas on how to correct this?  Unfortunately, it's not feasible
for me to only perform searches on optimized indices because we are doing
constant updates.

Thanks,

Asif


Otis Gospodnetic wrote:
> 
> 
> Asif,
> 
> Could it have something to do with the deleted documents in your
> unoptimized index?  There documents are only marked as deleted.  When you
> run optimize you really remove them completely.  It could be that they are
> getting counted by something and that messes up the scoring/order.
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Incorrect-sort-with-with-function-query-in-query-parameters-tp22735009p22741058.html
Sent from the Solr - User mailing list archive at Nabble.com.



Re: Incorrect sort with with function query in query parameters

2009-05-18 Thread Asif Rahman
I have been intending to although I have been dragging my feet on it.  I've
never opened a bug before so I'm not sure of the protocol.  If you don't
mind, it would be great if you could send me a pm and point me in the right
direction.

Thanks,

Asif


On Mon, May 18, 2009 at 7:30 PM, Ensdorf Ken  wrote:

> >
> > A Unit test would be ideal, but even if you can just provide a list of
> > steps (ie: using this solrconfig+schema, index these docs, then update
> > this one doc, then execute this search) it can help people track things
> > down.
> >
> > Please open a bug and attach as much detail as you can there.
> >
> >
> > -Hoss
>
> Was a bug ever opened on this?  I am seeing similar behavior (though in my
> case it's the debug scores that look wrong).
>
> -Ken
>
>


Facets with an IDF concept

2009-06-17 Thread Asif Rahman
Hi all,

We have an index of news articles that are tagged with news topics.
Currently, we use solr facets to see which topics are popular for a given
query or time period.  I'd like to apply the concept of IDF to the facet
counts so as to penalize the topics that occur broadly through our index.
I've begun to write custom facet component that applies the IDF to the facet
counts, but I also wanted to check if anyone has experience using facets in
this way.

Thanks,

Asif


Re: Facets with an IDF concept

2009-06-23 Thread Asif Rahman
Hi again,

I guess nobody has used facets in the way I described below before.  Do any
of the experts have any ideas as to how to do this efficiently and
correctly?  Any thoughts would be greatly appreciated.

Thanks,

Asif

On Wed, Jun 17, 2009 at 12:42 PM, Asif Rahman  wrote:

> Hi all,
>
> We have an index of news articles that are tagged with news topics.
> Currently, we use solr facets to see which topics are popular for a given
> query or time period.  I'd like to apply the concept of IDF to the facet
> counts so as to penalize the topics that occur broadly through our index.
> I've begun to write custom facet component that applies the IDF to the facet
> counts, but I also wanted to check if anyone has experience using facets in
> this way.
>
> Thanks,
>
> Asif
>



-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com


Re: Facets with an IDF concept

2009-06-23 Thread Asif Rahman
Hi Kent,

Your problem is close cousin of the problem that we're tackling.  We have
experience the same problem as you when calculating facets on MoreLikeThis
queries, since those queries tend to match a lot of documents.  We used one
of the solutions that you mentioned, rank cutoff, to solve it.  We first run
the MoreLikeThis query, then use the top N documents' unique ids as a filter
query for a second query.  The performance is still acceptable, however our
index size is smaller than yours by an order of magnitude.

Regards,

Asif

On Tue, Jun 23, 2009 at 10:34 AM, Kent Fitch  wrote:

> Hi Asif,
>
> I was holding back because we have a similar problem, but we're not
> sure how best to approach it, or even whether approaching it at all is
> the right thing to do.
>
> Background:
> - large index (~35m documents)
> - about 120k on these include full text book contents plus metadata,
> the rest are just metadata
> - we plan to increase number of full text books to around 1m, number
> of records will greatly increase
>
> We've found that because of the sheer volume of content in full text,
> we get lots of results in full text of very low relevance. The Lucene
> relevance ranking works wonderfully to "hide" these way down the list,
> and when these are the only results at all, the user may be delighted
> to find obscure hits.
>
> But when you search for, say : soldier of fortune : one of the 55k+
> results is Huck Finn, with 4 "soldier(s)" and 6 "fortunes", but it
> probably isn't relevant.  The searcher will find it in the result
> sets, but should the author, subject, dates, formats etc (our facets)
> of Huck Finn be contributing to the facets shown to the user as
> equally as, say, the top 500 results?  Maybe, but perhaps they are
> "diluting" the value of facets contributed by the more relevant
> results.
>
> So, we are considering restricting the contents of the result bit set
> used for faceting to exclude results with a very very low score (with
> our own QueryComponent).  But there are problems:
>
> - what's a low score?  How will a low score threshold vary across
> queries? (Or should we use a rank cutoff instead, which is much more
> expensive to compute, or some combo that works with results that only
> have very low relevance results?)
>
> - should we do this for all facets, or just some (where the less
> relevant results seem particularly annoying, as they can "mask" facets
> from the most relevant results - the authors, years and subjects we
> have full text for are not representative of the whole corpus)
>
> - if a searcher pages through to the 1000th result page, down to these
> less relevant results, should we somehow include these results in the
> facets we show?
>
> sorry, only more questions!
>
> Regards,
>
> Kent Fitch
>
> On Tue, Jun 23, 2009 at 5:58 PM, Asif Rahman wrote:
> > Hi again,
> >
> > I guess nobody has used facets in the way I described below before.  Do
> any
> > of the experts have any ideas as to how to do this efficiently and
> > correctly?  Any thoughts would be greatly appreciated.
> >
> > Thanks,
> >
> > Asif
> >
> > On Wed, Jun 17, 2009 at 12:42 PM, Asif Rahman  wrote:
> >
> >> Hi all,
> >>
> >> We have an index of news articles that are tagged with news topics.
> >> Currently, we use solr facets to see which topics are popular for a
> given
> >> query or time period.  I'd like to apply the concept of IDF to the facet
> >> counts so as to penalize the topics that occur broadly through our
> index.
> >> I've begun to write custom facet component that applies the IDF to the
> facet
> >> counts, but I also wanted to check if anyone has experience using facets
> in
> >> this way.
> >>
> >> Thanks,
> >>
> >> Asif
> >>
> >
> >
> >
> > --
> > Asif Rahman
> > Lead Engineer - NewsCred
> > a...@newscred.com
> > http://platform.newscred.com
> >
>



-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com


Re: Facets with an IDF concept

2009-06-23 Thread Asif Rahman
Hi Grant,

I'll give a real life example of the problem that we are trying to solve.

We index a large number of current news articles on a continuing basis.  We
tag these articles with news topics (e.g. Barack Obama, Iran, etc.).  We
then use these tags to facet our queries.  For example, we might issue a
query for all articles in the last 24 hours.  The facets would then tell us
which news topics have been written about the most in that period.  The
problem is that "Barack Obama", for example, is always written about in high
frequency, as opposed to "Iran" which is currently very hot in the news, but
which has not always been the case.  In this case, we'd like to see "Iran"
show up higher than "Barack Obama" in the facet results.

To me, this seems identical to the tf-idf scoring expression that is used in
normal search.  The facet count is analogous to the tf and I can access the
facet term idf's through the Similarity API.

Is my reasoning sound?  Can you provide any guidance as to the best way to
implement this?

Thanks for your help,

Asif


On Tue, Jun 23, 2009 at 1:19 PM, Grant Ingersoll wrote:

>
> On Jun 23, 2009, at 3:58 AM, Asif Rahman wrote:
>
>  Hi again,
>>
>> I guess nobody has used facets in the way I described below before.  Do
>> any
>> of the experts have any ideas as to how to do this efficiently and
>> correctly?  Any thoughts would be greatly appreciated.
>>
>> Thanks,
>>
>> Asif
>>
>> On Wed, Jun 17, 2009 at 12:42 PM, Asif Rahman  wrote:
>>
>>  Hi all,
>>>
>>> We have an index of news articles that are tagged with news topics.
>>> Currently, we use solr facets to see which topics are popular for a given
>>> query or time period.  I'd like to apply the concept of IDF to the facet
>>> counts so as to penalize the topics that occur broadly through our index.
>>> I've begun to write custom facet component that applies the IDF to the
>>> facet
>>> counts, but I also wanted to check if anyone has experience using facets
>>> in
>>> this way.
>>>
>>
>
> I'm not sure I'm following.  Would you be faceting on one field, but using
> the DF from some other field?  Faceting is already a count of all the
> documents that contain the term on a given field for that search.  If I'm
> understanding, you would still do the typical faceting, but then rerank by
> the global DF values, right?
>
> Backing up, what is the problem you are seeing that you are trying to
> solve?
>
> I think you could do this, but you'd have to hook it in yourself.  By
> penalize, do you mean remove, or just have them in the sort?  Generally
> speaking, looking up the DF value can be expensive, especially if you do a
> lot of skipping around.  I don't know how pluggable the sort capabilities
> are for faceting, but that might be the place to start if you are just
> looking at the sorting options.
>
>
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com


Re: Create incremental snapshot

2009-07-10 Thread Asif Rahman
Tushar:

Is it necessary to do the optimize on each iteration?  When you run an
optimize, the entire index is rewritten.  Thus each index file can have at
most one hard link and each snapshot will consume the full amount of space
on your disk.

Asir

On Thu, Jul 9, 2009 at 3:26 AM, tushar kapoor <
tushar_kapoor...@rediffmail.com> wrote:

>
> What I gather from this discussion is -
>
> 1. Snapshots are always hard links and not actual files so they cannot
> possibly consume the same amountof space.
> 2. Snapshots contain hard links to existing docs + delta docs.
>
> We are facing a situation wherein the snapshot occupies the same space as
> the actual indexes thus violating the first point.
> We have a batch processing scheme for refreshing indexes. the steps we
> follow are -
>
> 1. Delete 200 documents in one go.
> 2. Do an optimize.
> 3. Create the 200 documents deleted earlier.
> 4. Do a commit.
>
> This process continues for around 160,000 documents i.e. 800 times and by
> the end of it we have 800 snapshots.
>
> The size of actual indexes is 200 Mb and remarkably all the 800 snapshots
> are of size around 200 Mb each. In effect this process consumes around 160
> Gb space on our disks. This is causing a lot of pain right now.
>
> My concern are - Is our understanding of the snapshooter correct ? Should
> this massive space consumption be happening at all ? Are we missing
> something critical ?
>
> Regards,
> Tushar.
>
> Shalin Shekhar Mangar wrote:
> >
> > On Sat, Apr 18, 2009 at 1:06 PM, Koushik Mitra
> > wrote:
> >
> >> Ok
> >>
> >> If these are hard links, then where does the index data get stored?
> Those
> >> must be getting stored somewhere in the file system.
> >>
> >
> > Yes, of course they are stored on disk. The hard links are created from
> > the
> > actual files inside the index directory. When those older files are
> > deleted
> > by Solr, they are still left on the disk if at least one hard link to
> that
> > file exists. If you are looking for how to clean old snapshots, you could
> > use the snapcleaner script.
> >
> > Is that what you wanted to do?
> >
> > --
> > Regards,
> > Shalin Shekhar Mangar.
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Create-incremental-snapshot-tp23109877p24405434.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>


-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com


Solr 1.5 in production

2010-02-19 Thread Asif Rahman
What is the prevailing opinion on using solr 1.5 in a production
environment?  I know that many people were using 1.4 in production for a
while before it became an official release.

Specifically I'm interested in using some of the new spatial features.

Thanks,

Asif

-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com


Re: Solr 1.5 in production

2010-02-20 Thread Asif Rahman
One piece of functionality that I need is the ability to index a spatial
shape.  I've begun implementing this for solr 1.4 using just the spatial
capabilities in lucene with a custom update processor and query parser.  At
this point I'm only supporting rectangles and the shapes are being indexed
as sets of spatial tiles.  In solr 1.5, I believe the correct implementation
would be as a field type with the new subfielding capabilities.  Do you have
any thoughts about the approach I'm taking?

Asif

On Fri, Feb 19, 2010 at 5:10 PM, Grant Ingersoll wrote:

>
> On Feb 19, 2010, at 4:54 PM, Asif Rahman wrote:
>
> > What is the prevailing opinion on using solr 1.5 in a production
> > environment?  I know that many people were using 1.4 in production for a
> > while before it became an official release.
> >
> > Specifically I'm interested in using some of the new spatial features.
>
> These aren't fully baked yet (still need some spatial filtering
> capabilities which I'm getting close to done with, or close enough to submit
> a patch anyway), but feedback would be welcome.  The main risk, I suppose,
> is that any new APIs could change.  Other than that, the usually advice
> applies:  Test it out in your environment and see if it meets your needs.
>  On the spatial stuff, we'd definitely appreciate feedback on performance,
> functionality, APIs, etc.
>
> -Grant




-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com


Re: Solr 1.5 in production

2010-02-22 Thread Asif Rahman
We're modeling hyperlocal news articles.  Each article is indexed with a
shape that corresponds to the region of the map that is covered by the
source of the article.  We considered modeling the locality of the articles
as points, but that approach would have limited our search options to
bounding box filters and slow distance calculations.  We also felt that the
shape approach more closely resembled the true nature of the data.  For now,
we're only using rectangles, but this approach will give us the ability to
index amorphous shapes as well.  Are there any other techniques for indexing
shape data?


On Mon, Feb 22, 2010 at 4:29 PM, Grant Ingersoll wrote:

>
> On Feb 20, 2010, at 8:53 AM, Asif Rahman wrote:
>
> > One piece of functionality that I need is the ability to index a spatial
> > shape.  I've begun implementing this for solr 1.4 using just the spatial
> > capabilities in lucene with a custom update processor and query parser.
>  At
> > this point I'm only supporting rectangles and the shapes are being
> indexed
> > as sets of spatial tiles.  In solr 1.5, I believe the correct
> implementation
> > would be as a field type with the new subfielding capabilities.  Do you
> have
> > any thoughts about the approach I'm taking?
>
> Yeah, I think in 1.5 you would use the subfield approach.  How do you plan
> on using the shape in search?
>
> >
> > Asif
> >
> > On Fri, Feb 19, 2010 at 5:10 PM, Grant Ingersoll  >wrote:
> >
> >>
> >> On Feb 19, 2010, at 4:54 PM, Asif Rahman wrote:
> >>
> >>> What is the prevailing opinion on using solr 1.5 in a production
> >>> environment?  I know that many people were using 1.4 in production for
> a
> >>> while before it became an official release.
> >>>
> >>> Specifically I'm interested in using some of the new spatial features.
> >>
> >> These aren't fully baked yet (still need some spatial filtering
> >> capabilities which I'm getting close to done with, or close enough to
> submit
> >> a patch anyway), but feedback would be welcome.  The main risk, I
> suppose,
> >> is that any new APIs could change.  Other than that, the usually advice
> >> applies:  Test it out in your environment and see if it meets your
> needs.
> >> On the spatial stuff, we'd definitely appreciate feedback on
> performance,
> >> functionality, APIs, etc.
> >>
> >> -Grant
> >
> >
> >
> >
> > --
> > Asif Rahman
> > Lead Engineer - NewsCred
> > a...@newscred.com
> > http://platform.newscred.com
>
> --
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem using Solr/Lucene:
> http://www.lucidimagination.com/search
>
>


-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com


Stemming Filters in wiki

2010-05-19 Thread Asif Rahman
I see that the entries for PorterStemFilterFactory,
EnglishPorterFilterFactory, and SnowballPorterFilterFactory have been
removed from the Analyzers, Tokenizers, and Token Filters wiki page.  Is
there a reason for this?

Thanks,

asif


-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com


Index-time vs. search-time boosting performance

2010-06-04 Thread Asif Rahman
Hi,

What are the performance ramifications for using a function-based boost at
search time (through bf in dismax parser) versus an index-time boost?
Currently I'm using boost functions on a 15GB index of ~14mm documents.  Our
queries generally match many thousands of documents.  I'm wondering if I
would see a performance improvement by switching over to index-time
boosting.

Thanks,

Asif

-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com


Re: Index-time vs. search-time boosting performance

2010-06-04 Thread Asif Rahman
Perhaps I should have been more specific in my initial post.  I'm doing
date-based boosting on the documents in my index, so as to assign a higher
score to more recent documents.  Currently I'm using a boost function to
achieve this.  I'm wondering if there would be a performance improvement if
instead of using the boost function at search time, I indexed the documents
with a date-based boost.

On Fri, Jun 4, 2010 at 7:30 PM, Erick Erickson wrote:

> Index time boosting is different than search time boosting, so
> asking about performance is irrelevant.
>
> Paraphrasing Hossman from years ago on the Lucene list (from
> memory).
>
> ...index time boosting is a way of saying this documents'
> title is more important than other documents' titles. Search
> time boosting is a way of saying "I care about documents
> whose titles contain this term more than other documents
> whose titles may match other parts of this query"
>
> HTH
> Erick
>
> On Fri, Jun 4, 2010 at 5:10 PM, Asif Rahman  wrote:
>
> > Hi,
> >
> > What are the performance ramifications for using a function-based boost
> at
> > search time (through bf in dismax parser) versus an index-time boost?
> > Currently I'm using boost functions on a 15GB index of ~14mm documents.
> >  Our
> > queries generally match many thousands of documents.  I'm wondering if I
> > would see a performance improvement by switching over to index-time
> > boosting.
> >
> > Thanks,
> >
> > Asif
> >
> > --
> > Asif Rahman
> > Lead Engineer - NewsCred
> > a...@newscred.com
> > http://platform.newscred.com
> >
>



-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com


Re: Index-time vs. search-time boosting performance

2010-06-04 Thread Asif Rahman
It seems like it would be far more efficient to calculate the boost factor
once and store it rather than calculating it for each request in real-time.
Some of our queries match tens of thousands if not hundreds of thousands of
documents in a 15GB index.  However, I'm not well-versed in lucene internals
so I may be misunderstanding what is going on here.


On Fri, Jun 4, 2010 at 8:31 PM, Jay Hill  wrote:

> I've done a lot of recency boosting to documents, and I'm wondering why you
> would want to do that at index time. If you are continuously indexing new
> documents, what was "recent" when it was indexed becomes, over time "less
> recent". Are you unsatisfied with your current performance with the boost
> function? Query-time recency boosting is a fairly common thing to do, and,
> if done correctly, shouldn't be a performance concern.
>
> -Jay
> http://lucidimagination.com
>
>
> On Fri, Jun 4, 2010 at 4:50 PM, Asif Rahman  wrote:
>
> > Perhaps I should have been more specific in my initial post.  I'm doing
> > date-based boosting on the documents in my index, so as to assign a
> higher
> > score to more recent documents.  Currently I'm using a boost function to
> > achieve this.  I'm wondering if there would be a performance improvement
> if
> > instead of using the boost function at search time, I indexed the
> documents
> > with a date-based boost.
> >
> > On Fri, Jun 4, 2010 at 7:30 PM, Erick Erickson  > >wrote:
> >
> > > Index time boosting is different than search time boosting, so
> > > asking about performance is irrelevant.
> > >
> > > Paraphrasing Hossman from years ago on the Lucene list (from
> > > memory).
> > >
> > > ...index time boosting is a way of saying this documents'
> > > title is more important than other documents' titles. Search
> > > time boosting is a way of saying "I care about documents
> > > whose titles contain this term more than other documents
> > > whose titles may match other parts of this query"
> > >
> > > HTH
> > > Erick
> > >
> > > On Fri, Jun 4, 2010 at 5:10 PM, Asif Rahman  wrote:
> > >
> > > > Hi,
> > > >
> > > > What are the performance ramifications for using a function-based
> boost
> > > at
> > > > search time (through bf in dismax parser) versus an index-time boost?
> > > > Currently I'm using boost functions on a 15GB index of ~14mm
> documents.
> > > >  Our
> > > > queries generally match many thousands of documents.  I'm wondering
> if
> > I
> > > > would see a performance improvement by switching over to index-time
> > > > boosting.
> > > >
> > > > Thanks,
> > > >
> > > > Asif
> > > >
> > > > --
> > > > Asif Rahman
> > > > Lead Engineer - NewsCred
> > > > a...@newscred.com
> > > > http://platform.newscred.com
> > > >
> > >
> >
> >
> >
> > --
> > Asif Rahman
> > Lead Engineer - NewsCred
> > a...@newscred.com
> > http://platform.newscred.com
> >
>



-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com


Re: Index-time vs. search-time boosting performance

2010-06-05 Thread Asif Rahman
I know how to index a document with a boost but am still not sure whether
I'll see a search performance improvement with it.  The initial decision to
use a boost function at search-time was made to preserve the flexibility to
tweak the function without having to a full reindex.  I no longer need that
flexibility so was wondering if I would get better performance by implement
the boost at index-time.


On Fri, Jun 4, 2010 at 11:48 PM, Jonathan Rochkind  wrote:

> The SolrRelevancyFAQ does suggest that both index-time and search-time
> boosting can be used to boost the score of newer documents, but doesn't
> suggest what reasons/contexts one might choose one vs the other.  It only
> provides an example of search-time boost though, so it doesn't answer the
> question of how to do an index time boost, if that was a question.
>
>
> http://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_boost_the_score_of_newer_documents
>
> Sorry, this doesn't answer your question, but does contribute the fact that
> some author of the FAQ at some point considered index-time boost not
> neccesarily unreasonable.
> 
> From: Asif Rahman [a...@newscred.com]
> Sent: Friday, June 04, 2010 11:31 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Index-time vs. search-time boosting performance
>
> It seems like it would be far more efficient to calculate the boost factor
> once and store it rather than calculating it for each request in real-time.
> Some of our queries match tens of thousands if not hundreds of thousands of
> documents in a 15GB index.  However, I'm not well-versed in lucene
> internals
> so I may be misunderstanding what is going on here.
>
>
> On Fri, Jun 4, 2010 at 8:31 PM, Jay Hill  wrote:
>
> > I've done a lot of recency boosting to documents, and I'm wondering why
> you
> > would want to do that at index time. If you are continuously indexing new
> > documents, what was "recent" when it was indexed becomes, over time "less
> > recent". Are you unsatisfied with your current performance with the boost
> > function? Query-time recency boosting is a fairly common thing to do,
> and,
> > if done correctly, shouldn't be a performance concern.
> >
> > -Jay
> > http://lucidimagination.com
> >
> >
> > On Fri, Jun 4, 2010 at 4:50 PM, Asif Rahman  wrote:
> >
> > > Perhaps I should have been more specific in my initial post.  I'm doing
> > > date-based boosting on the documents in my index, so as to assign a
> > higher
> > > score to more recent documents.  Currently I'm using a boost function
> to
> > > achieve this.  I'm wondering if there would be a performance
> improvement
> > if
> > > instead of using the boost function at search time, I indexed the
> > documents
> > > with a date-based boost.
> > >
> > > On Fri, Jun 4, 2010 at 7:30 PM, Erick Erickson <
> erickerick...@gmail.com
> > > >wrote:
> > >
> > > > Index time boosting is different than search time boosting, so
> > > > asking about performance is irrelevant.
> > > >
> > > > Paraphrasing Hossman from years ago on the Lucene list (from
> > > > memory).
> > > >
> > > > ...index time boosting is a way of saying this documents'
> > > > title is more important than other documents' titles. Search
> > > > time boosting is a way of saying "I care about documents
> > > > whose titles contain this term more than other documents
> > > > whose titles may match other parts of this query"
> > > >
> > > > HTH
> > > > Erick
> > > >
> > > > On Fri, Jun 4, 2010 at 5:10 PM, Asif Rahman 
> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > What are the performance ramifications for using a function-based
> > boost
> > > > at
> > > > > search time (through bf in dismax parser) versus an index-time
> boost?
> > > > > Currently I'm using boost functions on a 15GB index of ~14mm
> > documents.
> > > > >  Our
> > > > > queries generally match many thousands of documents.  I'm wondering
> > if
> > > I
> > > > > would see a performance improvement by switching over to index-time
> > > > > boosting.
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Asif
> > > > >
> > > > > --
> > > > > Asif Rahman
> > > > > Lead Engineer - NewsCred
> > > > > a...@newscred.com
> > > > > http://platform.newscred.com
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Asif Rahman
> > > Lead Engineer - NewsCred
> > > a...@newscred.com
> > > http://platform.newscred.com
> > >
> >
>
>
>
> --
> Asif Rahman
> Lead Engineer - NewsCred
> a...@newscred.com
> http://platform.newscred.com
>



-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com


Re: Index-time vs. search-time boosting performance

2010-06-05 Thread Asif Rahman
Thanks everyone for your help so far.  I'm still trying to get to the bottom
of whether switching over to index-time boosts will give me a performance
improvement, and if so if it will be noticeable.  This is all under the
assumption that I can achieve the scoring functionality that I need with
either index-time or search-time boosting (given the loss of precision.  I
can always dust off the old profiler to see what's going on with the
search-time boosts, but testing the index-time boosts will require a full
reindex, which could take days with our dataset.

On Sat, Jun 5, 2010 at 9:17 AM, Robert Muir  wrote:

> On Fri, Jun 4, 2010 at 7:50 PM, Asif Rahman  wrote:
>
> > Perhaps I should have been more specific in my initial post.  I'm doing
> > date-based boosting on the documents in my index, so as to assign a
> higher
> > score to more recent documents.  Currently I'm using a boost function to
> > achieve this.  I'm wondering if there would be a performance improvement
> if
> > instead of using the boost function at search time, I indexed the
> documents
> > with a date-based boost.
> >
> >
> Asif, without knowing more details, before you look at performance you
> might
> want to consider the relevance impacts of switching to index-time boosting
> for your use case too.
>
> You can read more about the differences here:
> http://lucene.apache.org/java/3_0_1/scoring.html
>
> But I think the most important for this date-influenced use case is:
>
> "Indexing time boosts are preprocessed for storage efficiency and written
> to
> the directory (when writing the document) in a single byte (!)"
>
> If you do this as an index-time boost, your boosts will lose lots of
> precision for this reason.
>
> --
> Robert Muir
> rcm...@gmail.com
>



-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com


Re: Index-time vs. search-time boosting performance

2010-06-07 Thread Asif Rahman
I still need a relatively precise boost.  No less precise than hourly.  I
think that would make for a pretty messy field query.


On Mon, Jun 7, 2010 at 2:15 AM, Lance Norskog  wrote:

> If you are unhappy with the performance overhead of a function boost,
> you can push it into a field query by boosting date ranges.
>
> You would group in date ranges: documents in September would be
> boosted 1.0, October 2.0, November 3.0 etc.
>
>
> On 6/5/10, Asif Rahman  wrote:
> > Thanks everyone for your help so far.  I'm still trying to get to the
> bottom
> > of whether switching over to index-time boosts will give me a performance
> > improvement, and if so if it will be noticeable.  This is all under the
> > assumption that I can achieve the scoring functionality that I need with
> > either index-time or search-time boosting (given the loss of precision.
>  I
> > can always dust off the old profiler to see what's going on with the
> > search-time boosts, but testing the index-time boosts will require a full
> > reindex, which could take days with our dataset.
> >
> > On Sat, Jun 5, 2010 at 9:17 AM, Robert Muir  wrote:
> >
> >> On Fri, Jun 4, 2010 at 7:50 PM, Asif Rahman  wrote:
> >>
> >> > Perhaps I should have been more specific in my initial post.  I'm
> doing
> >> > date-based boosting on the documents in my index, so as to assign a
> >> higher
> >> > score to more recent documents.  Currently I'm using a boost function
> to
> >> > achieve this.  I'm wondering if there would be a performance
> improvement
> >> if
> >> > instead of using the boost function at search time, I indexed the
> >> documents
> >> > with a date-based boost.
> >> >
> >> >
> >> Asif, without knowing more details, before you look at performance you
> >> might
> >> want to consider the relevance impacts of switching to index-time
> boosting
> >> for your use case too.
> >>
> >> You can read more about the differences here:
> >> http://lucene.apache.org/java/3_0_1/scoring.html
> >>
> >> But I think the most important for this date-influenced use case is:
> >>
> >> "Indexing time boosts are preprocessed for storage efficiency and
> written
> >> to
> >> the directory (when writing the document) in a single byte (!)"
> >>
> >> If you do this as an index-time boost, your boosts will lose lots of
> >> precision for this reason.
> >>
> >> --
> >> Robert Muir
> >> rcm...@gmail.com
> >>
> >
> >
> >
> > --
> > Asif Rahman
> > Lead Engineer - NewsCred
> > a...@newscred.com
> > http://platform.newscred.com
> >
>
>
> --
> Lance Norskog
> goks...@gmail.com
>



-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com


Re: how to test solr's performance?

2010-06-10 Thread Asif Rahman
I'm currently doing a trial with NewRelic's RPM Gold product.  I'm having a
great experience so far.  Among other things, it gives stats on response
time and throughput, broken down by request URL.  It also gives stats on
heap size and GC.  Using this tool, we were able to determine which calls
were using the most resources and implemented application level caching to
ease the load on our Solr server.  It also helped us tweak some issues we
were having with garbage collection.


On Thu, Jun 10, 2010 at 4:53 AM, Marc Sturlese wrote:

>
> I normally use jmeter, jconsole and iostat. Recently
> http://www.newrelic.com/solr.html has been released
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/how-to-test-solr-s-performance-tp881928p885025.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>



-- 
Asif Rahman
Lead Engineer - NewsCred
a...@newscred.com
http://platform.newscred.com