bitsondatadev commented on PR #1534:
URL: https://github.com/apache/iceberg-python/pull/1534#issuecomment-2599251493
> @kevinjqliu - i'm doing this work on behalf of my company and when i ran
my tests, i used a standard python virtual environment venv; i haven't figured
out quite yet how to get poetry to work inside my companies firewall. So, not
sure if those are errors I can address or if someone else can pitch in here.
@mattmartin14, what's going on man!? Thanks for working on this and most
impressively thanks for the comprehensive description.
Out of curiosity, did you discuss your approach with anyone before putting
this together? This is good but a few flags for OSS contributions to lower the
upfront back and forth:
1. My main concern is it looks like you're introducing `datafusion` to get
this feature out and adding dependencies is generally avoided until absolutely
necessary (i.e. the value of the new dependency serves many use cases or won't
cause a lot of maintenance). If you haven't, I recommend informally discussing
this with @Fokko or @kevinjqliu to see if datafusion is already on the roadmap
or can be considered. Likely the conversation can finish with them, but they
may suggest asking on the mailing list to document and discuss it with more
folks. There may already be ways of doing this using the existing library and
not adding another dependency.
1. As you've noticed the build is breaking with the poetry issue and it
sounds like this is happening on a work laptop. I recommend working with your
company to allow you to work on OSS code outside of a company device as I'm
familiar with how locked down their system is and it will be difficult to
contribute behind the corporate VPN. I did this for contributions where I
submitted the code to my repository on my OSS account (we weren't required to
have company accounts so this worked out anyways). This would require a review
to ensure sensitive information wasn't leaked. Once I submitted the code to my
branch, I was authorized to work on that code on my personal laptop (generally
off hours to get things polished for merging).
1. For Apache projects, you have to add license notifications to each file
you add and these files are missing:
```
!?????
/home/runner/work/iceberg-python/iceberg-python/pyiceberg/table/merge_rows_util.py
!?????
/home/runner/work/iceberg-python/iceberg-python/pyiceberg/table/test.py
!?????
/home/runner/work/iceberg-python/iceberg-python/tests/table/test_merge_rows.py
```
1. There's bits of [dead
code](https://github.com/apache/iceberg-python/pull/1534/files#diff-23e8153e0fd497a9212215bd2067068f3b56fa071770c7ef326db3d3d03cee9bR1123)
scattered about, I'm not sure if that's for discussion or todo. I would either
remove it or comment on it when this PR is ready for review.
1. > Where I am hitting a wall on performance
Aside from total disregard for it, performance should be pretty low on
the list of concerns. This was also a new paradigm when I submitted to open
source. You're not writing code for a corporate team-sized set of eyes, you're
writing it for many people to evaluate. I always recommend doing the bare
minimum unless the performant version adds ten or twenty lines of code. Focus
on correctness and edge cases first. Second focus on minimalism and clarity.
There's 500+ lines of code that people have to get their heads around. That can
be justified, but often can be broken up to lower the burden on those reviewing
the code and lowers the chances you'll sneak a bug in unintentionally.
Contributing to OSS has a different focus than internal code, so hopefully
these help. This does look well thought out in terms of implementation, but
performance should be second or third try in favor of having code history that
everyone in the community can wrap their heads around.
I'd suggest to get these addressed before Fokko and Kevin scan it. I'll be
happy to do a quick glance once the tests are running and there's some
consensus around datafusion.
PR #1 yeah!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]