Re: What is the best way of Indexing different formats of documents?

Erick Erickson Tue, 07 Apr 2015 10:01:31 -0700

The disadvantages of DIH are
1> it's a black box, debugging it isn't easy
2> it puts all the work on the Solr node. Parsing documents in various
forms can be pretty heavy-weight and steal cycles from indexing and
searching.
2a> the extracting request handler also puts all the load on Solr FWIW.



Personally I prefer an external program (and I was gratified to see
Yavar's reference to the indexing with SolrJ article...). But then I'm
a Java programmer by training, so that seems easy...

Best,
Erick

On Tue, Apr 7, 2015 at 7:41 AM, Dan Davis <dansm...@gmail.com> wrote:
> Sangeetha,
>
> You can also run Tika directly from data import handler, and Data Import
> Handler can be made to run several threads if you can partition the input
> documents by directory or database id.   I've done 4 "threads" by having a
> base configuration that does an Oracle query like this:
>
>       SELECT * (SELECT id, url, ..., Modulo(rowNum, 4) as threadid FROM ...
> WHERE ...) WHERE threadid = %d
>
> A bash/sed script writes several data import handler XML files.
> I can then index several threads at a time.
>
> Each of these threads can then use all the transformers, e.g.
> templateTransformer, etc.
> XML can be transformed via XSLT.
>
> The Data Import Handler has other entities that go out to the web and then
> index the document via Tika.
>
> If you are indexing generic HTML, you may want to figure out an approach to
> SOLR-3808 and SOLR-2250 - this can be resolved by recompiling Solr and Tika
> locally, because Boilerpipe has a bug that has been fixed, but not pushed
> to Maven Central.   Without that, the ASF cannot include the fix, but
> distributions such as LucidWorks Solr Enterprise can.
>
> I can drop some configs into github.com if I clean them up to obfuscate
> host names, passwords, and such.
>
>
> On Tue, Apr 7, 2015 at 9:14 AM, Yavar Husain <yavarhus...@gmail.com> wrote:
>
>> Well have indexed heterogeneous sources including a variety of NoSQL's,
>> RDBMs and Rich Documents (PDF Word etc.) using SolrJ. The only prerequisite
>> of using SolrJ is that you should have an API to fetch data from your data
>> source (Say JDBC for RDBMS, Tika for extracting text content from rich
>> documents etc.) than SolrJ is so damn great and simple. Its as simple as
>> downloading the jar and few lines of code to send data to your solr server
>> after pre-processing your data. More details here:
>>
>> http://lucidworks.com/blog/indexing-with-solrj/
>>
>> https://wiki.apache.org/solr/Solrj
>>
>> http://www.solrtutorial.com/solrj-tutorial.html
>>
>> Cheers,
>> Yavar
>>
>>
>>
>> On Tue, Apr 7, 2015 at 4:18 PM, sangeetha.subraman...@gtnexus.com <
>> sangeetha.subraman...@gtnexus.com> wrote:
>>
>> > Hi,
>> >
>> > I am a newbie to SOLR and basically from database background. We have a
>> > requirement of indexing files of different formats (x12,edifact,
>> csv,xml).
>> > The files which are inputted can be of any format and we need to do a
>> > content based search on it.
>> >
>> > From the web I understand we can use TIKA processor to extract the
>> content
>> > and store it in SOLR. What I want to know is, is there any better
>> approach
>> > for indexing files in SOLR ? Can we index the document through streaming
>> > directly from the Application ? If so what is the disadvantage of using
>> it
>> > (against DIH which fetches from the database)? Could someone share me
>> some
>> > insight on this ? ls there any web links which I can refer to get some
>> idea
>> > on it ? Please do help.
>> >
>> > Thanks
>> > Sangeetha
>> >
>> >
>>

Re: What is the best way of Indexing different formats of documents?

Reply via email to