On 7 October 2014 14:08, Vishal Sharma <vish...@grazitti.com> wrote: > Hi, > > I am trying to get some help on finding out if there is any best practice > to index wordpress blogs in solr index? Can someone help with architecture > I shoudl be setting up? > > Do, I need to write separate scripts to crawl wordpress and then pump posts > back to Solr using its API?
Is your goal WordPress indexing or specifically indexing into Solr. Because there are services such as: https://wordpress.org/plugins/swiftype-search/ Otherwise, the question is the level of access you have to the WordPress. You could index feeds WordPress produces (there is an example in the distribution for RSS parsing). Or you could pull it directly from the database. Or - if the real-time is not important, you could periodically do WordPress export (to XML) and parse that. I would NOT parse the HTML and try to recreate that. As to the rest of the architecture, you need to know whether you are just indexing generic WordPress or also extensions such as custom taxonomies, custom values, etc. These are all important questions because they will drive the Solr architecture more than the original question you seem to be asking. Regards, Alex. Personal: http://www.outerthoughts.com/ and @arafalov Solr resources and newsletter: http://www.solr-start.com/ and @solrstart Solr popularizers community: https://www.linkedin.com/groups?gid=6713853