SOLR interface with PHP using javabin?
I am planning on creating a website that has some SOLR search capabilities for the users, and was also planning on using PHP for the server-side scripting. My goal is to find the most efficient way to submit search queries from the website, interface with SOLR, and display the results back on the website. If I use PHP, it seems that all the solutions use some form of character based stream for the interface. It would seem that using a binary representation, such as javabin, would be more efficient. If using javabin, or some similar efficient binary stream to interface SOLR with PHP is not possible, what do people recommend as the most efficient solution that provides the best performance, even if that means not using PHP and going with some other alternative? Thank you, Ben
Re: SOLR interface with PHP using javabin?
OK, thanks for the suggestion. Why do you recommend using JSON over simply using the built-in PHPSerializedResponseWriter? I find using an interface that requires the data to be parsed to be inefficient (this would include the aforementioned PHPSerializedResponseWriter as well). Wouldn't it be far better to use some standard data structure that is sent as a bit stream? Ben On 9/16/2010 11:38 AM, Thomas Joiner wrote: If you wish to interface to Solr from PHP, and decide to go with Yonik's suggestion to use JSON, I would suggest using http://code.google.com/p/solr-php-client/ It has served my needs for the most part. On Thu, Sep 16, 2010 at 1:33 PM, Yonik Seeleywrote: On Thu, Sep 16, 2010 at 2:30 PM, onlinespend...@gmail.com wrote: I am planning on creating a website that has some SOLR search capabilities for the users, and was also planning on using PHP for the server-side scripting. My goal is to find the most efficient way to submit search queries from the website, interface with SOLR, and display the results back on the website. If I use PHP, it seems that all the solutions use some form of character based stream for the interface. It would seem that using a binary representation, such as javabin, would be more efficient. If using javabin, or some similar efficient binary stream to interface SOLR with PHP is not possible, what do people recommend as the most efficient solution that provides the best performance, even if that means not using PHP and going with some other alternative? I'd recommend going with JSON - it will be quite a bit smaller than XML, and the parsers are generally quite efficient. -Yonik http://lucenerevolution.org Lucene/Solr Conference, Boston Oct 7-8
Solr Distributed Search "start parameter" limitation
If you look at the Solr wiki, one of the limitations of distributed searching it mentions is with regards to the start parameter. http://wiki.apache.org/solr/DistributedSearch#Distributed_Searching_Limitations "Makes it more inefficient to use a high "start" parameter. For example, if you request start=50&rows=25 on an index with 500,000+ docs per shard, this will currently result in 500,000 records getting sent over the network from the shard to the coordinating Solr instance. If you had a single-shard index, in contrast, only 25 records would ever get sent over the network." While I may not have a start parameter of 500,000, I could easily have one of 50,000, and it concerns me the hit in performance I may take when using such a high start parameter with distributed searching. I would use this if the user had issued a search query that resulted in say 50,000+ matches. I may only display 40 matches per web page, with the user having the ability to "jump" to the end of the results. So specifying a high start parameter is certainly likely, and I know this sort of scenario is common for a lot of websites. Are there tricks that can be played to avoid the performance hit associated with specifying a high start parameter when doing distributed searching? Thanks, Ben
access document by primary key
what's the quickest and most efficient way to access a doc by its primary key? suppose I already know a document's unique id and simply want to fetch it without issuing a sophisticated query. Thanks, Ben
keeping data consistent between Database and Solr
Like many people, Solr is not my primary data store. Not all of my data need be searchable and for simple and fast retrieval I store it in a database (Cassandra in my case). Actually I don't have this all built up yet, but my intention is that whenever new data is entered that it be added to my Cassandra database and simultaneously added to the Solr index (either by queuing up recent data before a commit or some other means; any suggestions on this front?). But my main question is, how do I guarantee that data between my Cassandra database and Solr index are consistent and up-to-date? What if I write the data to Cassandra and then a failure occurs during the commit to the Solr index? I would need to be aware what data failed to commit and make sure that a re-attempt is made. Obviously inconsistency for a short duration is inevitable when using two different databases (Cassandra and Solr), but I certainly don't want a failure to create perpetual inconsistency. I'm curious what sort of mechanisms people are using to ensure consistency between their database (MySQL, Cassandra, etc.) and Solr. Thank you, Ben
Re: keeping data consistent between Database and Solr
Solandra is great for adding better scalability and NRT to Solr, but it pretty much just stores the index in Cassandra and insulates that from the user. It doesn't solve the problem of allowing quick and direct retrieval of data that need not be searched. I could certainly just use a Solr search query to "directly" access a single document, but that has overhead and would not be as efficient as directly accessing a database. With potentially tens of thousands of simultaneous direct data accesses, I'd rather not put this burden on Solr and would prefer to use it only for searchas it was intended, while simple data retrieval could come from a better equipped database. But my question of consistency applies to all databases and Solr. i would imagine most people maintain separate MySQL and Solr databases. On Tuesday, March 15, 2011, Bill Bell wrote: > Look at Solandra. Solr + Cassandra. > > On 3/14/11 9:38 PM, "onlinespend...@gmail.com" > wrote: > >>Like many people, Solr is not my primary data store. Not all of my data >>need >>be searchable and for simple and fast retrieval I store it in a database >>(Cassandra in my case). Actually I don't have this all built up yet, but >>my >>intention is that whenever new data is entered that it be added to my >>Cassandra database and simultaneously added to the Solr index (either by >>queuing up recent data before a commit or some other means; any >>suggestions >>on this front?). >> >>But my main question is, how do I guarantee that data between my Cassandra >>database and Solr index are consistent and up-to-date? What if I write >>the >>data to Cassandra and then a failure occurs during the commit to the Solr >>index? I would need to be aware what data failed to commit and make sure >>that a re-attempt is made. Obviously inconsistency for a short duration >>is >>inevitable when using two different databases (Cassandra and Solr), but I >>certainly don't want a failure to create perpetual inconsistency. I'm >>curious what sort of mechanisms people are using to ensure consistency >>between their database (MySQL, Cassandra, etc.) and Solr. >> >>Thank you, >>Ben > > >
Re: keeping data consistent between Database and Solr
That's pretty interesting to use the autoincrementing document ID as a way to keep track of what has not been indexed in Solr. And you overwrite this document ID even when you modify an existing document. Very cool. I suppose the number can even rotate back to 0, as long as you handle that. I am thinking of using a timestamp to achieve a similar thing. All documents that have been accessed after the last Solr index need to be added to the Solr index. In fact, each name-value pair in Cassandra has a timestamp associated with it, so I'm curious if I could simply use this. I'm curious how you handle the delta-imports. Do you have some routine that periodically checks for updates to your MySQL database via the document ID? Which language do you use for that? Thanks, Ben On Tue, Mar 15, 2011 at 9:12 AM, Shawn Heisey wrote: > On 3/14/2011 9:38 PM, onlinespend...@gmail.com wrote: > >> But my main question is, how do I guarantee that data between my Cassandra >> database and Solr index are consistent and up-to-date? >> > > Our MySQL database has two unique indexes. One is a document ID, > implemented in MySQL as an autoincrement integer and in Solr as a long. The > other is what we call a tag id, implemented in MySQL as a varchar and Solr > as a single lowercased token and serving as Solr's uniqueKey. We have an > update trigger on the database that updates the document ID whenever the > database document is updated. > > We have a homegrown build system for Solr. In a nutshell, it keeps track > of the newest document ID in the Solr Index. If the DIH delta-import fails, > it doesn't update the stored ID, which means that on the next run, it will > try and index those documents again. Changes to the entries in the database > are automatically picked up because the document ID is newer, but the tag id > doesn't change, so the document in Solr is overwritten. > > Things are actually more complex than I've written, because our index is > distributed. Hopefully it can give you some ideas for yours. > > Shawn > >
Re: keeping data consistent between Database and Solr
On Mon, Mar 21, 2011 at 10:57 AM, Shawn Heisey wrote: > On 3/15/2011 12:54 PM, onlinespend...@gmail.com wrote: > >> That's pretty interesting to use the autoincrementing document ID as a way >> to keep track of what has not been indexed in Solr. And you overwrite >> this >> document ID even when you modify an existing document. Very cool. I >> suppose the number can even rotate back to 0, as long as you handle that. >> > > We use a bigint for the value, and the highest value is currently less than > 300 million, so we don't expect it to ever rotate around to 0. My build > system would not be able to handle wrapraound without manual intervention. > If we have that problem, I think we'd have to renumber the entire database > and reindex. One solution to reduce the rate at which this number grows would be to store a "batch ID" rather than a "document ID". If you've just added batch #1428 to the Solr index, then any new updated documents in your SQL database would be assigned #1429. Since you already have a unique tag ID, you may be OK with a non-unique ID for the sake of keeping track of index updates. > > > I am thinking of using a timestamp to achieve a similar thing. All >> documents >> that have been accessed after the last Solr index need to be added to the >> Solr index. In fact, each name-value pair in Cassandra has a timestamp >> associated with it, so I'm curious if I could simply use this. >> > > As long as you can guarantee that it's all deterministic and idempotent, > you can use anything you like. I hope you know what those words mean. :) > It's important when using timestamps that the system that runs the build > script is the same one that stores the last-used timestamp. That way you > are guaranteed that you will never have things getting missed because of > clock skew. Yes, that is a concern of mine. If I go with a timestamp I'll certainly need to pay close attention to things. > > > I'm curious how you handle the delta-imports. Do you have some routine >> that >> periodically checks for updates to your MySQL database via the document >> ID? >> Which language do you use for that? >> > > The entire build system is written in Perl, where I am comfortable. I even > wrote an object-oriented module that the scripts share. The update script > runs every two minutes, from cron, indexing anything with a higher document > ID than the one recorded during the last successful run. There are some > other scripts that run on longer intervals and handle things like deletes > and data redistribution into shards. These scripts kick off the build, then > use the bare /dataimport URL to track when the import completes and whether > it's successful. > Thanks, > Shawn > Thanks for the info. That's very helpful! Ben