Re: Processing/Indexing CSV

Ken Krugler Thu, 09 Jun 2011 14:57:36 -0700

On Jun 9, 2011, at 2:21pm, Helmut Hoffer von Ankershoffen wrote:

> Hi,
> 
> btw: there seems to somewhat of a non-match regarding efforts to Enhance DIH
> regarding the CSV format (James Dyer) and the effort to maintain the
> CSVLoader (Ken Krugler). How about merging your efforts and migrating the
> CSVLoader to a CSVEntityProcessor (cp. my initial email)? :-)


While I'm a CSVLoader user (and I've found/fixed one bug in it), I'm not 
involved in any active development/maintenance of that piece of code.

If James or you can make progress on merging support for CSV into DIH, that's 
great.

-- Ken


> On Thu, Jun 9, 2011 at 11:17 PM, Helmut Hoffer von Ankershoffen <
> [email protected]> wrote:
> 
>> 
>> 
>> On Thu, Jun 9, 2011 at 11:05 PM, Ken Krugler 
>> <[email protected]>wrote:
>> 
>>> 
>>> On Jun 9, 2011, at 1:27pm, Helmut Hoffer von Ankershoffen wrote:
>>> 
>>>> Hi,
>>>> 
>>>> ... that would be an option if there is a defined set of field names and
>>> a
>>>> single column/CSV layout. The scenario however is different csv files
>>> (from
>>>> different shops) with individual column layouts (separators, encodings
>>>> etc.). The idea is to map known field names to defined field names in
>>> the
>>>> solr schema. If I understand the capabilities of the CSVLoader correctly
>>>> (sorry, I am completely new to Solr, started work on it today) this is
>>> not
>>>> possible - is it?
>>> 
>>> As per the documentation on
>>> http://wiki.apache.org/solr/UpdateCSV#fieldnames, you can specify the
>>> names/positions of fields in the CSV file, and ignore fieldnames.
>>> 
>>> So this seems like it would solve your requirement, as each different
>>> layout could specify its own such mapping during import.
>>> 
>>> Sure, but the requirement (to keep the process of integrating new shops
>> efficient) is not to have one mapping per import (cp. the Email regarding
>> "more or less schema free") but to enhance one mapping that maps common
>> field names to defined fields disregarding order of known fields/columns. As
>> far as I understand that is not a problem at all with DIH, however DIH and
>> CSV are not a perfect match ,-)
>> 
>> 
>>> It could be handy to provide a fieldname map (versus the value map that
>>> UpdateCSV supports).
>> 
>> Definitely. Either a fieldname map in CSVLoader or a robust CSVLoader in
>> DIH ...
>> 
>> 
>>> Then you could use the header, and just provide a mapping from header
>>> fieldnames to schema fieldnames.
>>> 
>> That's the idea -)
>> 
>> => what's the best way to progress. Either someone enhances the CSVLoader
>> by a field mapper (with multipel input field names mapping to one field name
>> in the Solr schema) or someone enhances the DIH with a robust CSV loader
>> ,-). As I am completely new to this Community, please give me the direction
>> to go (or wait :-).
>> 
>> best regards
>> 
>> 
>>> -- Ken
>>> 
>>>> On Thu, Jun 9, 2011 at 10:12 PM, Yonik Seeley <
>>> [email protected]>wrote:
>>>> 
>>>>> On Thu, Jun 9, 2011 at 4:07 PM, Helmut Hoffer von Ankershoffen
>>>>> <[email protected]> wrote:
>>>>>> Hi,
>>>>>> yes, it's about CSV files loaded via HTTP from shops to be fed into a
>>>>>> shopping search engine.
>>>>>> The CSV Loader cannot map fields (only field values) etc.
>>>>> 
>>>>> You can provide your own list of fieldnames and optionally ignore the
>>>>> first line of the CSV file (assuming it contains the field names).
>>>>> http://wiki.apache.org/solr/UpdateCSV#fieldnames
>>>>> 
>>>>> -Yonik
>>>>> http://www.lucidimagination.com
>>>>> 
>>> 
>>> --------------------------
>>> Ken Krugler
>>> +1 530-210-6378
>>> http://bixolabs.com
>>> custom data mining solutions
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>> 

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
custom data mining solutions

Re: Processing/Indexing CSV

Reply via email to