Hi,

to make my point more clear: if the CSV has a fixed schema / column layout,
using the RegexTransformer is of course a possibility (however awkward). But
if you want to implement a (more or less) schema free shopping search engine
...

regards

On Thu, Jun 9, 2011 at 9:31 PM, Helmut Hoffer von Ankershoffen <
helmut...@googlemail.com> wrote:

> Hi,
>
> there seems to be no way to index CSV using the DataImportHandler.
>
> Using a combination of 
> LineEntityProcessor<http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor>
>  and 
> RegexTransformer<http://wiki.apache.org/solr/DataImportHandler#RegexTransformer>
>  as
> proposed in
> http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is
>  not working for real world CSV files.
>
> E.g. many CSV files have double-quotes enclosing some but not all columns -
> there is no elegant way to segment this using a simple regular expression.
>
> As CSV is still very common esp. in E-Commerce scenarios, I propose that
> Solr provides a CSVEntityProcessor that:
> 1) Handles the case of CSV files with/without and with some double-quote
> enclosed columns
> 2) Allows for a configurable column separator (';',',','\t' etc.)
> 3) Allows for a leading row containing column headings
> 4) If there is a leading row with column headings provides a possibility to
> address columns by their column names and map them to Solr fields (similar
> to the XPathEntityProcessor)
> 5) Auto-detects encoding of the file (UTF-8 etc.)
>
> This would make it A LOT easier to use Solr for E-Commerce scenarios.
>
> If there is no such entity processor in the works i will develop one ... So
> please let me know.
>
> Regards
>

Reply via email to