Hi, to make my point more clear: if the CSV has a fixed schema / column layout, using the RegexTransformer is of course a possibility (however awkward). But if you want to implement a (more or less) schema free shopping search engine ...
regards On Thu, Jun 9, 2011 at 9:31 PM, Helmut Hoffer von Ankershoffen < helmut...@googlemail.com> wrote: > Hi, > > there seems to be no way to index CSV using the DataImportHandler. > > Using a combination of > LineEntityProcessor<http://wiki.apache.org/solr/DataImportHandler#LineEntityProcessor> > and > RegexTransformer<http://wiki.apache.org/solr/DataImportHandler#RegexTransformer> > as > proposed in > http://robotlibrarian.billdueber.com/an-exercise-in-solr-and-dataimporthandler-hathitrust-data/is > not working for real world CSV files. > > E.g. many CSV files have double-quotes enclosing some but not all columns - > there is no elegant way to segment this using a simple regular expression. > > As CSV is still very common esp. in E-Commerce scenarios, I propose that > Solr provides a CSVEntityProcessor that: > 1) Handles the case of CSV files with/without and with some double-quote > enclosed columns > 2) Allows for a configurable column separator (';',',','\t' etc.) > 3) Allows for a leading row containing column headings > 4) If there is a leading row with column headings provides a possibility to > address columns by their column names and map them to Solr fields (similar > to the XPathEntityProcessor) > 5) Auto-detects encoding of the file (UTF-8 etc.) > > This would make it A LOT easier to use Solr for E-Commerce scenarios. > > If there is no such entity processor in the works i will develop one ... So > please let me know. > > Regards >