Re: Processing/Indexing CSV

Erick Erickson Fri, 10 Jun 2011 05:50:30 -0700

Well, here's a place to start if you want to patch the code:

http://wiki.apache.org/solr/HowToContribute

If you do want to take this on, hop on over to the dev list
and start a discussion. I'd start with some posts on that list
before entering or working on a JIRA issue, just ask for
some guidance. A good place to start is pretty much what
you've done here, state your problem, and what you think
the correct behavior is.

Be prepared for things to be brought up you never thought
of <G>... which is the point of starting the discussion there.

A very good way to start is to get the code, compile it, and then
run some of the test cases in an IDE, stepping through the test
case in the debugger. Sometimes that doesn't work easily, but
if it does it gives you an idea of how the code works. There are
instructions at the above link for setting things up in an IDE
(Eclipse and Intellij are popular).

Just loading the project and looking for files that begin with
CSV might be a place to start. Then look for files that begin
with TestCSV. Both of these "look promising".

Anyway, if you get that far, then go over to the dev list and say
"I'm thinking of XXX, this code appears to be handled in YYY and
I'm thinking of changing it like ZZZ" and it will be well received.

Of course if you want to go ahead and make your changes and submit
a patch, that's even better, but it's often best to get a bit of guidance first.

Best
Erick

On Thu, Jun 9, 2011 at 5:17 PM, Helmut Hoffer von Ankershoffen
<helmut...@googlemail.com> wrote:
> On Thu, Jun 9, 2011 at 11:05 PM, Ken Krugler 
> <kkrugler_li...@transpac.com>wrote:
>
>>
>> On Jun 9, 2011, at 1:27pm, Helmut Hoffer von Ankershoffen wrote:
>>
>> > Hi,
>> >
>> > ... that would be an option if there is a defined set of field names and
>> a
>> > single column/CSV layout. The scenario however is different csv files
>> (from
>> > different shops) with individual column layouts (separators, encodings
>> > etc.). The idea is to map known field names to defined field names in the
>> > solr schema. If I understand the capabilities of the CSVLoader correctly
>> > (sorry, I am completely new to Solr, started work on it today) this is
>> not
>> > possible - is it?
>>
>> As per the documentation on
>> http://wiki.apache.org/solr/UpdateCSV#fieldnames, you can specify the
>> names/positions of fields in the CSV file, and ignore fieldnames.
>>
>> So this seems like it would solve your requirement, as each different
>> layout could specify its own such mapping during import.
>>
>> Sure, but the requirement (to keep the process of integrating new shops
> efficient) is not to have one mapping per import (cp. the Email regarding
> "more or less schema free") but to enhance one mapping that maps common
> field names to defined fields disregarding order of known fields/columns. As
> far as I understand that is not a problem at all with DIH, however DIH and
> CSV are not a perfect match ,-)
>
>
>> It could be handy to provide a fieldname map (versus the value map that
>> UpdateCSV supports).
>
> Definitely. Either a fieldname map in CSVLoader or a robust CSVLoader in DIH
> ...
>
>
>> Then you could use the header, and just provide a mapping from header
>> fieldnames to schema fieldnames.
>>
> That's the idea -)
>
> => what's the best way to progress. Either someone enhances the CSVLoader by
> a field mapper (with multipel input field names mapping to one field name in
> the Solr schema) or someone enhances the DIH with a robust CSV loader ,-).
> As I am completely new to this Community, please give me the direction to go
> (or wait :-).
>
> best regards
>
>
>> -- Ken
>>
>> > On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley <
>> yo...@lucidimagination.com>wrote:
>> >
>> >> On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen
>> >> <helmut...@googlemail.com> wrote:
>> >>> Hi,
>> >>> yes, it's about CSV files loaded via HTTP from shops to be fed into a
>> >>> shopping search engine.
>> >>> The CSV Loader cannot map fields (only field values) etc.
>> >>
>> >> You can provide your own list of fieldnames and optionally ignore the
>> >> first line of the CSV file (assuming it contains the field names).
>> >> http://wiki.apache.org/solr/UpdateCSV#fieldnames
>> >>
>> >> -Yonik
>> >> http://www.lucidimagination.com
>> >>
>>
>> --------------------------
>> Ken Krugler
>> +1 530-210-6378
>> http://bixolabs.com
>> custom data mining solutions
>>
>>
>>
>>
>>
>>
>>
>

Re: Processing/Indexing CSV

Reply via email to