bitsondatadev commented on PR #1534: URL: https://github.com/apache/iceberg-python/pull/1534#issuecomment-2599251493
> @kevinjqliu - i'm doing this work on behalf of my company and when i ran my tests, i used a standard python virtual environment venv; i haven't figured out quite yet how to get poetry to work inside my companies firewall. So, not sure if those are errors I can address or if someone else can pitch in here. @mattmartin14, what's going on man!? Thanks for working on this and most impressively thanks for the comprehensive description. Out of curiosity, did you discuss your approach with anyone before putting this together? This is good but a few flags for OSS contributions to lower the upfront back and forth: 1. My main concern is it looks like you're introducing `datafusion` to get this feature out and adding dependencies is generally avoided until absolutely necessary (i.e. the value of the new dependency serves many use cases or won't cause a lot of maintenance). If you haven't, I recommend informally discussing this with @Fokko or @kevinjqliu to see if datafusion is already on the roadmap or can be considered. Likely the conversation can finish with them, but they may suggest asking on the mailing list to document and discuss it with more folks. There may already be ways of doing this using the existing library and not adding another dependency. 1. As you've noticed the build is breaking with the poetry issue and it sounds like this is happening on a work laptop. I recommend working with your company to allow you to work on OSS code outside of a company device as I'm familiar with how locked down their system is and it will be difficult to contribute behind the corporate VPN. I did this for contributions where I submitted the code to my repository on my OSS account (we weren't required to have company accounts so this worked out anyways). This would require a review to ensure sensitive information wasn't leaked. Once I submitted the code to my branch, I was authorized to work on that code on my personal laptop (generally off hours to get things polished for merging). 1. For Apache projects, you have to add license notifications to each file you add and these files are missing: ``` !????? /home/runner/work/iceberg-python/iceberg-python/pyiceberg/table/merge_rows_util.py !????? /home/runner/work/iceberg-python/iceberg-python/pyiceberg/table/test.py !????? /home/runner/work/iceberg-python/iceberg-python/tests/table/test_merge_rows.py ``` 1. There's bits of [dead code](https://github.com/apache/iceberg-python/pull/1534/files#diff-23e8153e0fd497a9212215bd2067068f3b56fa071770c7ef326db3d3d03cee9bR1123) scattered about, I'm not sure if that's for discussion or todo. I would either remove it or comment on it when this PR is ready for review. 1. > Where I am hitting a wall on performance Aside from total disregard for it, performance should be pretty low on the list of concerns. This was also a new paradigm when I submitted to open source. You're not writing code for a corporate team-sized set of eyes, you're writing it for many people to evaluate. I always recommend doing the bare minimum unless the performant version adds ten or twenty lines of code. Focus on correctness and edge cases first. Second focus on minimalism and clarity. There's 500+ lines of code that people have to get their heads around. That can be justified, but often can be broken up to lower the burden on those reviewing the code and lowers the chances you'll sneak a bug in unintentionally. Contributing to OSS has a different focus than internal code, so hopefully these help. This does look well thought out in terms of implementation, but performance should be second or third try in favor of having code history that everyone in the community can wrap their heads around. I'd suggest to get these addressed before Fokko and Kevin scan it. I'll be happy to do a quick glance once the tests are running and there's some consensus around datafusion. PR #1 yeah! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org