RE: [CSV] Strategies to handle duplicate headers

2023-06-20 Thread Seth Falco

I don't have a strong enough opinion to conclude what's best.

Giving it more thought, I think the interface approach I proposed is 
overcomplicated tbh. I can't imagine needing another duplicate header 
mode after this.


However, I could imagine situations where we define 
DuplicateHeaderMode.DEDUPLICATE, and a user isn't satisfied with our 
normalization strategy. For example, dots in the headers breaks 
ingesting the data in a third-party system. An interface could resolve 
this, but I guess in such a scenario, they can also just opt for another 
mode and normalize it themselves to bypass ours.


With that in mind, appending the enum does make sense. I'd still be wary 
about making it default behavior anytime soon, unless there's evidence 
that deduplication is really what users expect.


Something to consider though. We allow configuring the delimiter. I 
think parsing would be fine, but it might introduce edge-cases for 
printing if the delimiter and normalization strategy overlap. For 
example, "A,A" becomes "A.1,A,2" but the delimiter is ".", effectively 
making it "A.1.A.2". We'll need test cases for that.


PS: Sorry if this message goes through twice. Looked to me that the 
email didn't go through the first time.


On 2023/06/20 21:28:16 Gary Gregory wrote:

> That's clever. So we could implement a new enum value
> DuplicateHeaderMode.DEDUPLICATE...
>
> Gary
>
> On Tue, Jun 20, 2023, 14:09 Bruno Kinoshita  wrote:
>
> > Hi,
> >
> > Bruno says:
> > > "With Pandas it automatically deduplicates the column names. Maybe
> > > that's a feature that we could have in Commons CSV too?"
> > >
> > > What does that mean and actually do? Say I have column A with row 1
> > > value of "X" and 2nd column A with row 1 value of 2. What do I get
> > > when I ask for column A row 1?
> > >
> >
> > When you ask for column A, you get the first column A with row 1 
value of
> > "X". Then Pandas renames the other A column as "A.1". If you want 
to access

> > rows in the second A column, then you will use "A.1" as index.
> >
> > This is useful when you work with CSV's with many headers so that 
you still
> > have a valid name to use as index to access data, instead of having 
to rely
> > on the column index, for instance (or if you are using other 
libraries that

> > work with the column names, etc.)
> >
> > As a first cut whatever we do could/should maintain the existing
> > > behavior. We can change the default later by popular demand.
> > >
> >
> > +1
> >
> > Cheers
> >
> > Bruno
> >
> > On Tue, 20 Jun 2023 at 13:39, Gary Gregory  wrote:
> >
> > > Hi All,
> > >
> > > This thread is a follow-up to
> > > 
https://github.com/apache/commons-csv/pull/309#issuecomment-1441456258

> > >
> > > Bruno says:
> > > "With Pandas it automatically deduplicates the column names. Maybe
> > > that's a feature that we could have in Commons CSV too?"
> > >
> > > What does that mean and actually do? Say I have column A with row 1
> > > value of "X" and 2nd column A with row 1 value of 2. What do I get
> > > when I ask for column A row 1?
> > >
> > > Seth says:
> > > "HeaderStrategy Interface
> > > Contains two functions:
> > >
> > > #normalizeHeaders(headings) - With given heading, output a list that
> > > fits with whatever the strategy is going for.
> > > #get(record, header) - Fetch value(s) based on given column name."
> > >
> > > I would see perhaps two interfaces so that lambdas might be used more
> > > simply. Maybe, needs an example.
> > >
> > > "I'm also wary that this may screw up existing projects that 
depend on

> > > allowing/disallowing duplicates. i.e. want to allow duplicates and
> > > handle things through indexes / iteration, so this didn't cause a
> > > problem for them and want to preserve header names, and so don't need
> > > the headers deduplicated."
> > >
> > > As a first cut whatever we do could/should maintain the existing
> > > behavior. We can change the default later by popular demand.
> > >
> > > Gary
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > > For additional commands, e-mail: dev-h...@commons.apache.org
> > >
> > >
> >
>
--
GitHub: https://github.com/SethFalco
Fediverse : 
@se...@fosstodon.org 

LinkedIn: https://www.linkedin.com/in/sethfalco/


OpenPGP_0xDE1C217EFF01FEC8.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature


RE: Re: [CSV] Strategies to handle duplicate headers

2023-06-21 Thread Seth Falco

I don't have a strong enough opinion to conclude what's best.

Giving it more thought, I think the interface approach I proposed is 
overcomplicated tbh. I can't imagine needing another duplicate header 
mode after this.


However, I could imagine situations where we define 
DuplicateHeaderMode.DEDUPLICATE, and a user isn't satisfied with our 
normalization strategy. For example, dots in the headers breaks 
ingesting the data in a third-party system. An interface could resolve 
this, but I guess in such a scenario, they can also just opt for another 
mode and normalize it themselves to bypass ours.


With that in mind, appending the enum does make sense. I'd still be wary 
about making it default behavior anytime soon, unless there's evidence 
that deduplication is really what users expect.


Something to consider though. We allow configuring the delimiter. I 
think parsing would be fine, but it might introduce edge-cases for 
printing if the delimiter and normalization strategy overlap. For 
example, "A,A" becomes "A.1,A,2" but the delimiter is ".", effectively 
making it "A.1.A.2". We'll need test cases for that.


On 2023/06/20 21:28:16 Gary Gregory wrote:

> That's clever. So we could implement a new enum value
> DuplicateHeaderMode.DEDUPLICATE...
>
> Gary
>
> On Tue, Jun 20, 2023, 14:09 Bruno Kinoshita  wrote:
>
> > Hi,
> >
> > Bruno says:
> > > "With Pandas it automatically deduplicates the column names. Maybe
> > > that's a feature that we could have in Commons CSV too?"
> > >
> > > What does that mean and actually do? Say I have column A with row 1
> > > value of "X" and 2nd column A with row 1 value of 2. What do I get
> > > when I ask for column A row 1?
> > >
> >
> > When you ask for column A, you get the first column A with row 1 
value of
> > "X". Then Pandas renames the other A column as "A.1". If you want 
to access

> > rows in the second A column, then you will use "A.1" as index.
> >
> > This is useful when you work with CSV's with many headers so that 
you still
> > have a valid name to use as index to access data, instead of having 
to rely
> > on the column index, for instance (or if you are using other 
libraries that

> > work with the column names, etc.)
> >
> > As a first cut whatever we do could/should maintain the existing
> > > behavior. We can change the default later by popular demand.
> > >
> >
> > +1
> >
> > Cheers
> >
> > Bruno
> >
> > On Tue, 20 Jun 2023 at 13:39, Gary Gregory  wrote:
> >
> > > Hi All,
> > >
> > > This thread is a follow-up to
> > > 
https://github.com/apache/commons-csv/pull/309#issuecomment-1441456258

> > >
> > > Bruno says:
> > > "With Pandas it automatically deduplicates the column names. Maybe
> > > that's a feature that we could have in Commons CSV too?"
> > >
> > > What does that mean and actually do? Say I have column A with row 1
> > > value of "X" and 2nd column A with row 1 value of 2. What do I get
> > > when I ask for column A row 1?
> > >
> > > Seth says:
> > > "HeaderStrategy Interface
> > > Contains two functions:
> > >
> > > #normalizeHeaders(headings) - With given heading, output a list that
> > > fits with whatever the strategy is going for.
> > > #get(record, header) - Fetch value(s) based on given column name."
> > >
> > > I would see perhaps two interfaces so that lambdas might be used more
> > > simply. Maybe, needs an example.
> > >
> > > "I'm also wary that this may screw up existing projects that 
depend on

> > > allowing/disallowing duplicates. i.e. want to allow duplicates and
> > > handle things through indexes / iteration, so this didn't cause a
> > > problem for them and want to preserve header names, and so don't need
> > > the headers deduplicated."
> > >
> > > As a first cut whatever we do could/should maintain the existing
> > > behavior. We can change the default later by popular demand.
> > >
> > > Gary
> > >
> > > -
> > > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org
> > > For additional commands, e-mail: dev-h...@commons.apache.org
> > >
> > >
> >
>


OpenPGP_0xDE1C217EFF01FEC8.asc
Description: OpenPGP public key


OpenPGP_signature
Description: OpenPGP digital signature