bitsondatadev commented on PR #1534:
URL: https://github.com/apache/iceberg-python/pull/1534#issuecomment-2599251493

   > @kevinjqliu - i'm doing this work on behalf of my company and when i ran 
my tests, i used a standard python virtual environment venv; i haven't figured 
out quite yet how to get poetry to work inside my companies firewall. So, not 
sure if those are errors I can address or if someone else can pitch in here.
   
   @mattmartin14, what's going on man!? Thanks for working on this and most 
impressively thanks for the comprehensive description.
   
   Out of curiosity, did you discuss your approach with anyone before putting 
this together? This is good but a few flags for OSS contributions to lower the 
upfront back and forth:
   
   1. My main concern is it looks like you're introducing `datafusion` to get 
this feature out and adding dependencies is generally avoided until absolutely 
necessary (i.e. the value of the new dependency serves many use cases or won't 
cause a lot of maintenance). If you haven't, I recommend informally discussing 
this with @Fokko or @kevinjqliu to see if datafusion is already on the roadmap 
or can be considered. Likely the conversation can finish with them, but they 
may suggest asking on the mailing list to document and discuss it with more 
folks. There may already be ways of doing this using the existing library and 
not adding another dependency.
   1. As you've noticed the build is breaking with the poetry issue and it 
sounds like this is happening on a work laptop. I recommend working with your 
company to allow you to work on OSS code outside of a company device as I'm 
familiar with how locked down their system is and it will be difficult to 
contribute behind the corporate VPN. I did this for contributions where I 
submitted the code to my repository on my OSS account (we weren't required to 
have company accounts so this worked out anyways). This would require a review 
to ensure sensitive information wasn't leaked. Once I submitted the code to my 
branch, I was authorized to work on that code on my personal laptop (generally 
off hours to get things polished for merging).
   1. For Apache projects, you have to add license notifications to each file 
you add and these files are missing:
   ```
    !????? 
/home/runner/work/iceberg-python/iceberg-python/pyiceberg/table/merge_rows_util.py
    !????? 
/home/runner/work/iceberg-python/iceberg-python/pyiceberg/table/test.py
    !????? 
/home/runner/work/iceberg-python/iceberg-python/tests/table/test_merge_rows.py
   ```
   1. There's bits of [dead 
code](https://github.com/apache/iceberg-python/pull/1534/files#diff-23e8153e0fd497a9212215bd2067068f3b56fa071770c7ef326db3d3d03cee9bR1123)
 scattered about, I'm not sure if that's for discussion or todo. I would either 
remove it or comment on it when this PR is ready for review.
   
   1. > Where I am hitting a wall on performance
      Aside from total disregard for it, performance should be pretty low on 
the list of concerns. This was also a new paradigm when I submitted to open 
source. You're not writing code for a corporate team-sized set of eyes, you're 
writing it for many people to evaluate. I always recommend doing the bare 
minimum unless the performant version adds ten or twenty lines of code. Focus 
on correctness and edge cases first. Second focus on minimalism and clarity. 
There's 500+ lines of code that people have to get their heads around. That can 
be justified, but often can be broken up to lower the burden on those reviewing 
the code and lowers the chances you'll sneak a bug in unintentionally.
      
   Contributing to OSS has a different focus than internal code, so hopefully 
these help. This does look well thought out in terms of implementation, but 
performance should be second or third try in favor of having code history that 
everyone in the community can wrap their heads around.
   
   I'd suggest to get these addressed before Fokko and Kevin scan it. I'll be 
happy to do a quick glance once the tests are running and there's some 
consensus around datafusion.
   
   PR #1 yeah!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Reply via email to