Re: Best Practices for open source pipeline/connectors

Jürgen Wagner (DVT) Tue, 04 Nov 2014 13:51:17 -0800

Hello Dan,
  ManifoldCF is a connector framework, not a processing framework.
Therefore, you may try your own lightweight connectors (which usually
are not really rocket science and may take less time to write than time
to configure a super-generic connector of some sort), any connector out
there (including Nutch and others), or even commercial offerings from
some companies. That, however, won't make you very happy all by itself -
my guess. Key to really creating value out of data dragged into a search
platform is the processing pipeline. Depending on the scale of data and
the amount of processing you need to do, you may have a simplistic
approach with just some more or less configurable Java components
massaging your data until it can be sent to Solr (without using Tika or
any other processing in Solr), or you can employ frameworks like Apache
Spark to really heavily transform and enrich data before feeding them
into Solr.

I prefer to have a clear separation between connectors, processing,
indexing/querying and front-end visualization/interaction. Only the
indexing/querying task I grant to Solr (or naked Lucene or
Elasticsearch). Each of the different task types has entirely different
scaling requirements and computing/networking properties, so you
definitely don't want them depend on each other too much. Addressing the
needs of several customers, one needs to even swap one or the other
component in favour of what a customer prefers or needs.

So, my answer is YES. But we've also tried Nutch, our own specialized
crawlers and a number of elaborate connectors for special customer
applications. In any case, the result of that connector won't go into
Solr. It will go into processing. From there it will go into Solr. I
suspect that connectors won't be the challenge in your project. Solr
requires a bit of tuning and tweaking, but you'll be fine eventually.
Document processing will be the fun part. As you come to scaling the zoo
of components, this will become evident :-)

What is the volume and influx rate in your scenario?

Best regards,
--Jürgen

On 04.11.2014 22:01, Dan Davis wrote:
> I'm trying to do research for my organization on the best practices for
> open source pipeline/connectors.   Since we need Web Crawls, File System
> crawls, and Databases, it seems to me that Manifold CF might be the best
> case.
>
> Has anyone combined ManifestCF with Solr UpdateRequestProcessors or
> DataImportHandler?   It would be nice to decide in ManifestCF which
> resultHandler should receive a document or id, barring that, you can post
> some fields including an URL and have Data Import Handler handle it - it
> already supports scripts whereas ManifestCF may not at this time.
>
> Suggestions and ideas?
>
> Thanks,
>
> Dan
>

-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com
<mailto:juergen.wag...@devoteam.com>, URL: www.devoteam.de
<http://www.devoteam.de/>

------------------------------------------------------------------------
Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071

Re: Best Practices for open source pipeline/connectors

Reply via email to