SOLR interface with PHP using javabin?

2010-09-16 Thread onlinespend...@gmail.com
 I am planning on creating a website that has some SOLR search 
capabilities for the users, and was also planning on using PHP for the 
server-side scripting.


My goal is to find the most efficient way to submit search queries from 
the website, interface with SOLR, and display the results back on the 
website.  If I use PHP, it seems that all the solutions use some form of 
character based stream for the interface.  It would seem that using a 
binary representation, such as javabin, would be more efficient.


If using javabin, or some similar efficient binary stream to interface 
SOLR with PHP is not possible, what do people recommend as the most 
efficient solution that provides the best performance, even if that 
means not using PHP and going with some other alternative?


Thank you,
Ben


Re: SOLR interface with PHP using javabin?

2010-09-16 Thread onlinespend...@gmail.com
 OK, thanks for the suggestion.  Why do you recommend using JSON over 
simply using the built-in PHPSerializedResponseWriter?


I find using an interface that requires the data to be parsed to be 
inefficient (this would include the aforementioned 
PHPSerializedResponseWriter as well).  Wouldn't it be far better to use 
some standard data structure that is sent as a bit stream?


Ben

On 9/16/2010 11:38 AM, Thomas Joiner wrote:

If you wish to interface to Solr from PHP, and decide to go with Yonik's
suggestion to use JSON, I would suggest using
http://code.google.com/p/solr-php-client/

It has served my needs for the most part.

On Thu, Sep 16, 2010 at 1:33 PM, Yonik Seeleywrote:


On Thu, Sep 16, 2010 at 2:30 PM, onlinespend...@gmail.com
  wrote:

  I am planning on creating a website that has some SOLR search

capabilities

for the users, and was also planning on using PHP for the server-side
scripting.

My goal is to find the most efficient way to submit search queries from

the

website, interface with SOLR, and display the results back on the

website.

  If I use PHP, it seems that all the solutions use some form of character
based stream for the interface.  It would seem that using a binary
representation, such as javabin, would be more efficient.

If using javabin, or some similar efficient binary stream to interface

SOLR

with PHP is not possible, what do people recommend as the most efficient
solution that provides the best performance, even if that means not using
PHP and going with some other alternative?

I'd recommend going with JSON - it will be quite a bit smaller than
XML, and the parsers are generally quite efficient.

-Yonik
http://lucenerevolution.org  Lucene/Solr Conference, Boston Oct 7-8



Solr Distributed Search "start parameter" limitation

2011-02-01 Thread onlinespend...@gmail.com
If you look at the Solr wiki, one of the limitations of distributed
searching it mentions is with regards to the start parameter.

http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations

"Makes it more inefficient to use a high "start" parameter. For example, if
you request start=50&rows=25 on an index with 500,000+ docs per shard,
this will currently result in 500,000 records getting sent over the network
from the shard to the coordinating Solr instance. If you had a single-shard
index, in contrast, only 25 records would ever get sent over the network."

While I may not have a start parameter of 500,000, I could easily have one
of 50,000, and it concerns me the hit in performance I may take when using
such a high start parameter with distributed searching. I would use this if
the user had issued a search query that resulted in say 50,000+ matches. I
may only display 40 matches per web page, with the user having the ability
to "jump" to the end of the results. So specifying a high start parameter is
certainly likely, and I know this sort of scenario is common for a lot of
websites. Are there tricks that can be played to avoid the performance hit
associated with specifying a high start parameter when doing distributed
searching?

Thanks,
Ben


access document by primary key

2011-03-11 Thread onlinespend...@gmail.com
what's the quickest and most efficient way to access a doc by its primary
key? suppose I already know a document's unique id and simply want to fetch
it without issuing a sophisticated query.

Thanks,
Ben


keeping data consistent between Database and Solr

2011-03-14 Thread onlinespend...@gmail.com
Like many people, Solr is not my primary data store. Not all of my data need
be searchable and for simple and fast retrieval I store it in a database
(Cassandra in my case).  Actually I don't have this all built up yet, but my
intention is that whenever new data is entered that it be added to my
Cassandra database and simultaneously added to the Solr index (either by
queuing up recent data before a commit or some other means; any suggestions
on this front?).

But my main question is, how do I guarantee that data between my Cassandra
database and Solr index are consistent and up-to-date?  What if I write the
data to Cassandra and then a failure occurs during the commit to the Solr
index?  I would need to be aware what data failed to commit and make sure
that a re-attempt is made.  Obviously inconsistency for a short duration is
inevitable when using two different databases (Cassandra and Solr), but I
certainly don't want a failure to create perpetual inconsistency.  I'm
curious what sort of mechanisms people are using to ensure consistency
between their database (MySQL, Cassandra, etc.) and Solr.

Thank you,
Ben


Re: keeping data consistent between Database and Solr

2011-03-15 Thread onlinespend...@gmail.com
Solandra is great for adding better scalability and NRT to Solr, but
it pretty much just stores the index in Cassandra and insulates that
from the user. It doesn't solve the problem of allowing quick and
direct retrieval of data that need not be searched. I could certainly
just use a Solr search query to "directly" access a single document,
but that has overhead and would not be as efficient as directly
accessing a database. With potentially tens of thousands of
simultaneous direct data accesses, I'd rather not put this burden on
Solr and would prefer to use it only for searchas it was intended,
while simple data retrieval could come from a better equipped
database.

But my question of consistency applies to all databases and Solr. i
would imagine most people maintain separate MySQL and Solr databases.

On Tuesday, March 15, 2011, Bill Bell  wrote:
> Look at Solandra. Solr + Cassandra.
>
> On 3/14/11 9:38 PM, "onlinespend...@gmail.com" 
> wrote:
>
>>Like many people, Solr is not my primary data store. Not all of my data
>>need
>>be searchable and for simple and fast retrieval I store it in a database
>>(Cassandra in my case).  Actually I don't have this all built up yet, but
>>my
>>intention is that whenever new data is entered that it be added to my
>>Cassandra database and simultaneously added to the Solr index (either by
>>queuing up recent data before a commit or some other means; any
>>suggestions
>>on this front?).
>>
>>But my main question is, how do I guarantee that data between my Cassandra
>>database and Solr index are consistent and up-to-date?  What if I write
>>the
>>data to Cassandra and then a failure occurs during the commit to the Solr
>>index?  I would need to be aware what data failed to commit and make sure
>>that a re-attempt is made.  Obviously inconsistency for a short duration
>>is
>>inevitable when using two different databases (Cassandra and Solr), but I
>>certainly don't want a failure to create perpetual inconsistency.  I'm
>>curious what sort of mechanisms people are using to ensure consistency
>>between their database (MySQL, Cassandra, etc.) and Solr.
>>
>>Thank you,
>>Ben
>
>
>


Re: keeping data consistent between Database and Solr

2011-03-15 Thread onlinespend...@gmail.com
That's pretty interesting to use the autoincrementing document ID as a way
to keep track of what has not been indexed in Solr.  And you overwrite this
document ID even when you modify an existing document.  Very cool.  I
suppose the number can even rotate back to 0, as long as you handle that.

I am thinking of using a timestamp to achieve a similar thing. All documents
that have been accessed after the last Solr index need to be added to the
Solr index.  In fact, each name-value pair in Cassandra has a timestamp
associated with it, so I'm curious if I could simply use this.

I'm curious how you handle the delta-imports. Do you have some routine that
periodically checks for updates to your MySQL database via the document ID?
Which language do you use for that?

Thanks,
Ben

On Tue, Mar 15, 2011 at 9:12 AM, Shawn Heisey  wrote:

> On 3/14/2011 9:38 PM, onlinespend...@gmail.com wrote:
>
>> But my main question is, how do I guarantee that data between my Cassandra
>> database and Solr index are consistent and up-to-date?
>>
>
> Our MySQL database has two unique indexes.  One is a document ID,
> implemented in MySQL as an autoincrement integer and in Solr as a long.  The
> other is what we call a tag id, implemented in MySQL as a varchar and Solr
> as a single lowercased token and serving as Solr's uniqueKey.  We have an
> update trigger on the database that updates the document ID whenever the
> database document is updated.
>
> We have a homegrown build system for Solr.  In a nutshell, it keeps track
> of the newest document ID in the Solr Index.  If the DIH delta-import fails,
> it doesn't update the stored ID, which means that on the next run, it will
> try and index those documents again.  Changes to the entries in the database
> are automatically picked up because the document ID is newer, but the tag id
> doesn't change, so the document in Solr is overwritten.
>
> Things are actually more complex than I've written, because our index is
> distributed.  Hopefully it can give you some ideas for yours.
>
> Shawn
>
>


Re: keeping data consistent between Database and Solr

2011-03-21 Thread onlinespend...@gmail.com
On Mon, Mar 21, 2011 at 10:57 AM, Shawn Heisey  wrote:

> On 3/15/2011 12:54 PM, onlinespend...@gmail.com wrote:
>
>> That's pretty interesting to use the autoincrementing document ID as a way
>> to keep track of what has not been indexed in Solr.  And you overwrite
>> this
>> document ID even when you modify an existing document.  Very cool.  I
>> suppose the number can even rotate back to 0, as long as you handle that.
>>
>
> We use a bigint for the value, and the highest value is currently less than
> 300 million, so we don't expect it to ever rotate around to 0.  My build
> system would not be able to handle wrapraound without manual intervention.
>  If we have that problem, I think we'd have to renumber the entire database
> and reindex.


One solution to reduce the rate at which this number grows would be to store
a "batch ID" rather than a "document ID". If you've just added batch #1428
to the Solr index, then any new updated documents in your SQL database would
be assigned #1429. Since you already have a unique tag ID, you may be OK
with a non-unique ID for the sake of keeping track of index updates.


>
>
>  I am thinking of using a timestamp to achieve a similar thing. All
>> documents
>> that have been accessed after the last Solr index need to be added to the
>> Solr index.  In fact, each name-value pair in Cassandra has a timestamp
>> associated with it, so I'm curious if I could simply use this.
>>
>
> As long as you can guarantee that it's all deterministic and idempotent,
> you can use anything you like.  I hope you know what those words mean. :)
>  It's important when using timestamps that the system that runs the build
> script is the same one that stores the last-used timestamp.  That way you
> are guaranteed that you will never have things getting missed because of
> clock skew.


Yes, that is a concern of mine. If I go with a timestamp I'll certainly need
to pay close attention to things.


>
>
>  I'm curious how you handle the delta-imports. Do you have some routine
>> that
>> periodically checks for updates to your MySQL database via the document
>> ID?
>> Which language do you use for that?
>>
>
> The entire build system is written in Perl, where I am comfortable.  I even
> wrote an object-oriented module that the scripts share.  The update script
> runs every two minutes, from cron, indexing anything with a higher document
> ID than the one recorded during the last successful run.  There are some
> other scripts that run on longer intervals and handle things like deletes
> and data redistribution into shards.  These scripts kick off the build, then
> use the bare /dataimport URL to track when the import completes and whether
> it's successful.


> Thanks,
> Shawn
>

Thanks for the info. That's very helpful!

Ben