Re: Is solr right for this scenario?

Eric Pugh Fri, 24 Apr 2009 06:00:07 -0700


On Apr 24, 2009, at 7:54 AM, Developer In London wrote:

Thanks for the fast reply. Wow this seems a very active community.

I have a few more questions in that case:

1) If Solr is going to be file-based, os it then preferable to runmultiple

Solrs with Shards? How can I determine what capacity 1 Solr can cope?

It depends! Solr can manage up to X records easily in a single index,however your milage may vary. One of the nice things about Solr is itis very scalable, and offers you many options. I would go with themost simple setup for Solr for now, and then as your developmentprogresses, and you load data then investigate sharding etc. Solr,properly managed, won't be your bottleneck, it's be your data loadingscripts or elsewhere.

2) I am presuming there is already tokenizers for hypertext and xmlin Solr
so that it can use extract the right information out?

There are a number of different options available out there forindexing content.

3) I need to also get the 'author' information out for things likeblogs. I
guess theres no universal way of doing it and I have to have someone
manually go through the documents and feed the solr index with theauthor
information?

Your loading script will be bespoke to your situation, however anycompetent developer can put together scripts to load from your varousdata sources.

When you mention 'write a loader script...', do you mean I should
incorporate the date checking in the loader script? Solr has nointernal way
of checking the timestamp in a document and updating?

Solr makes no assumptions about your data sources, it isn't a documentmanagement system, it is just a search engine. Well, that isn'ttotally true, the new DataImportHandler architecture does allow you topreserve some information about "when did I last run an update, whathas been updated since", however it's pretty new stuff.



Eric

Thanks,

Nayeem

2009/4/24 Eric Pugh <ep...@opensourceconnections.com>
It seems like you have three components to your system:

1) Data indexing from multiple sources

2) Search for specific words in documents

3) Preserve rating and search term.
I think that Solr comes into play on #1 and #2. You can indexcontent inany number of approaches, either via the new DataImportHandlerarchitecture,or the more traditional write a loader script that puts thedocuments inSolr. You can store in Solr when a document was indexed, and usethat tocheck against the original documents to see if they changed. Checka lastpublished tag on an RSS feed, or the last updated time on aphysical file.
This is a very common use case for Solr.
For #2, you could have users issue queries, and make them"favorites",
storing them in the DB.  Assuming they like the results they mark the
documents with the ratings, which you could store in Solr, but Iwould put
in a DB..  Easier to manage User A says 1, User B says 0.
Then for the UI, just issue the search baseed on queries stored inthe db,
and match the id's up with the ranking in the DB.  Simple!
As far as the last part, Solr works best in filesystem, that ispart of why
it is so fast, no clunky SQL.  There are scripts for backing up and
restoring indexes that you can use, check the wiki
http://wiki.apache.org/solr/SolrOperationsTools.

Eric




On Apr 24, 2009, at 6:18 AM, Developer In London wrote:

Hi All,
I am new to the whole Solr/Lucene community. But I think thismight be thesolution ot what I am looking to do. I would appreciate anyfeedback on
how
I can go about doing this with Solr:

I am looking to make a system where -
a) mainly lots of different blog sites, web journals, articles areindexed
on a regular basis. Data that has already been indexed needs to be
revisited
to see if there are any changes.
b) The end users has very fixed search terms, eg 'Lloyds TSB' and
'Corporate
Banking'. All the documents that are found matching this arepresented to
a
human to analyse.
c) Once the human analyses the document he gives it a rating of 1,0 or
-1.
This rating needs to be saved somewhere and be linked with thespecific
document and also with the search term (eg 'Lloyds TSB' & 'Corporate
Banking' in this case).
d) End users can then see these documents with the ratings next tothem.
What would be the best approach to this?

Should I set up a different database to save the rating and relevant
mappings, or is there any way to put it in to Solr?
My 2nd question is, can Solr Index be saved in a database in anyway?
Whats
the backup and recovery method on Solr?

Thanks in advance.

Nayeem
-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 |
http://www.opensourceconnections.com
Free/Busy: http://tinyurl.com/eric-cal
--
cashflowclublondon.co.uk

                     ("`-''-/").___..--''"`-._
                      `6_ 6  )   `-.  (     ).`-.__.`)
                      (_Y_.)'  ._   )  `._ `. ``-..-'
                    _..`--'_..-_/  /--'_.' ,'
                   (il),-''  (li),'  ((!.-'
.


-----------------------------------------------------
Eric Pugh | Principal | OpenSource Connections, LLC | 434.466.1467 | 
http://www.opensourceconnections.com
Free/Busy: http://tinyurl.com/eric-cal

Re: Is solr right for this scenario?

Reply via email to