[GitHub] [iceberg] nastra commented on issue #6415: Vectorized Read Issue

2022-12-14 Thread GitBox
nastra commented on issue #6415: URL: https://github.com/apache/iceberg/issues/6415#issuecomment-1350603108 See https://github.com/apache/iceberg/blob/c07f2aabc0a1d02f068ecf1514d2479c0fbdd3b0/arrow/src/test/java/org/apache/iceberg/arrow/vectorized/ArrowReaderTest.java#L149-L210 for some bac

[GitHub] [iceberg] nastra commented on issue #6415: Vectorized Read Issue

2022-12-14 Thread GitBox
nastra commented on issue #6415: URL: https://github.com/apache/iceberg/issues/6415#issuecomment-1350604548 @rdblue thoughts on getting the above issue fixed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL abo

[GitHub] [iceberg] nastra opened a new issue, #6423: Run JMH Benchmarks weekly / Visualize benchmark results

2022-12-14 Thread GitBox
nastra opened a new issue, #6423: URL: https://github.com/apache/iceberg/issues/6423 ### Feature Request / Improvement Currently we have a way to run JMH benchmarks on forks. The goal here is that JMH Benchmarks are executed on a weekly (or any other cadence) via a GitHub action.

[GitHub] [iceberg] ahshahid opened a new issue, #6424: The size estimation formula for spark task is incorrect

2022-12-14 Thread GitBox
ahshahid opened a new issue, #6424: URL: https://github.com/apache/iceberg/issues/6424 ### Apache Iceberg version main (development) ### Query engine Spark ### Please describe the bug 🐞 The size estimation formula used for non partition cols as seen in C

[GitHub] [iceberg] pvary commented on a diff in pull request #3337: Fixed issue #3336: Best efforts to release hive table lock

2022-12-14 Thread GitBox
pvary commented on code in PR #3337: URL: https://github.com/apache/iceberg/pull/3337#discussion_r1048287741 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java: ## @@ -499,11 +499,21 @@ private void unlock(Optional lockId) { } @VisibleForTes

[GitHub] [iceberg] nazq commented on issue #6415: Vectorized Read Issue

2022-12-14 Thread GitBox
nazq commented on issue #6415: URL: https://github.com/apache/iceberg/issues/6415#issuecomment-1351547529 Happy to create a PR @rdblue , just want to make sure we're on the right track here -- This is an automated message from the Apache Git Service. To respond to the message, please log

[GitHub] [iceberg] rubenvdg commented on issue #6361: Python: Ignore home folder when running tests

2022-12-14 Thread GitBox
rubenvdg commented on issue #6361: URL: https://github.com/apache/iceberg/issues/6361#issuecomment-1351585780 Happy to take this one on, if nobody is working on it atm. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[GitHub] [iceberg] RussellSpitzer commented on issue #6424: The size estimation formula for spark task is incorrect

2022-12-14 Thread GitBox
RussellSpitzer commented on issue #6424: URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351621767 The current code is slightly different than this, https://github.com/apache/iceberg/blob/33217abf7f88c6c22a8c43b320f9de48de998b94/api/src/main/java/org/apache/iceberg/C

[GitHub] [iceberg] nastra commented on issue #6415: Vectorized Read Issue

2022-12-14 Thread GitBox
nastra commented on issue #6415: URL: https://github.com/apache/iceberg/issues/6415#issuecomment-1351641875 @nazq just FYI, there's already #3024 that addresses this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and u

[GitHub] [iceberg] nazq commented on issue #6415: Vectorized Read Issue

2022-12-14 Thread GitBox
nazq commented on issue #6415: URL: https://github.com/apache/iceberg/issues/6415#issuecomment-1351656081 Excellent. Thanks for the pointer @nastra -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go t

[GitHub] [iceberg] nastra commented on issue #6420: Iceberg Materialized View Spec

2022-12-14 Thread GitBox
nastra commented on issue #6420: URL: https://github.com/apache/iceberg/issues/6420#issuecomment-1351656857 @JanKaul I think it would be great to get this out to the DEV mailing list to get more attention and input from people -- This is an automated message from the Apache Git Service. T

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

2022-12-14 Thread GitBox
ahshahid commented on issue #6424: URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351671855 @RussellSpitzer Right, I missed the modifiucation of " - splitOffset". Though the bug, which I think is in formula, still remains. My reasoning is as follows: the fu

[GitHub] [iceberg] stevenzwu merged pull request #6313: Flink: use correct metric config for position deletes

2022-12-14 Thread GitBox
stevenzwu merged PR #6313: URL: https://github.com/apache/iceberg/pull/6313 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.a

[GitHub] [iceberg] stevenzwu commented on pull request #6313: Flink: use correct metric config for position deletes

2022-12-14 Thread GitBox
stevenzwu commented on PR #6313: URL: https://github.com/apache/iceberg/pull/6313#issuecomment-1351737859 thanks @chenjunjiedada for the contribution -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[GitHub] [iceberg] RussellSpitzer commented on issue #6424: The size estimation formula for spark task is incorrect

2022-12-14 Thread GitBox
RussellSpitzer commented on issue #6424: URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351768402 > now if total row count of a split/file = (scannedFileFraction * file().recordCount()) This is I think the confusion, we are attempting to determine how many rows are

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6378: Spark: Extend Timeout During Partial Progress Rewrites

2022-12-14 Thread GitBox
RussellSpitzer commented on code in PR #6378: URL: https://github.com/apache/iceberg/pull/6378#discussion_r1048728860 ## core/src/main/java/org/apache/iceberg/actions/RewriteDataFilesCommitManager.java: ## @@ -225,25 +225,40 @@ public void close() { LOG.info("Closing comm

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6350: Query changelog table with a timestamp range

2022-12-14 Thread GitBox
RussellSpitzer commented on code in PR #6350: URL: https://github.com/apache/iceberg/pull/6350#discussion_r1048731625 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java: ## @@ -308,6 +339,17 @@ public Scan buildChangelogScan() { return n

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6350: Query changelog table with a timestamp range

2022-12-14 Thread GitBox
RussellSpitzer commented on code in PR #6350: URL: https://github.com/apache/iceberg/pull/6350#discussion_r1048750874 ## spark/v3.3/spark-extensions/src/test/java/org/apache/iceberg/spark/extensions/TestChangelogTable.java: ## @@ -137,6 +138,64 @@ public void testOverwrites() {

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6344: Spark 3.3: Introduce the changelog iterator

2022-12-14 Thread GitBox
RussellSpitzer commented on code in PR #6344: URL: https://github.com/apache/iceberg/pull/6344#discussion_r104878 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ChangelogIterator.java: ## @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6344: Spark 3.3: Introduce the changelog iterator

2022-12-14 Thread GitBox
RussellSpitzer commented on code in PR #6344: URL: https://github.com/apache/iceberg/pull/6344#discussion_r1048780674 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ChangelogIterator.java: ## @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6344: Spark 3.3: Introduce the changelog iterator

2022-12-14 Thread GitBox
RussellSpitzer commented on code in PR #6344: URL: https://github.com/apache/iceberg/pull/6344#discussion_r1048789305 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ChangelogIterator.java: ## @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6344: Spark 3.3: Introduce the changelog iterator

2022-12-14 Thread GitBox
RussellSpitzer commented on code in PR #6344: URL: https://github.com/apache/iceberg/pull/6344#discussion_r1048790825 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ChangelogIterator.java: ## @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6344: Spark 3.3: Introduce the changelog iterator

2022-12-14 Thread GitBox
RussellSpitzer commented on code in PR #6344: URL: https://github.com/apache/iceberg/pull/6344#discussion_r1048792877 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ChangelogIterator.java: ## @@ -0,0 +1,162 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6344: Spark 3.3: Introduce the changelog iterator

2022-12-14 Thread GitBox
RussellSpitzer commented on code in PR #6344: URL: https://github.com/apache/iceberg/pull/6344#discussion_r1048795379 ## spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/ChangelogIterator.java: ## @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

2022-12-14 Thread GitBox
ahshahid commented on issue #6424: URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351885991 Right... That I agree. May be along with split offset ( which is the start of split) , we need the end of split.. But still, pls allow me to describe this simplified case , wh

[GitHub] [iceberg] rdblue commented on a diff in pull request #6405: API: Add Aggregate expression evaluation

2022-12-14 Thread GitBox
rdblue commented on code in PR #6405: URL: https://github.com/apache/iceberg/pull/6405#discussion_r1048806645 ## api/src/main/java/org/apache/iceberg/expressions/CountStar.java: ## @@ -0,0 +1,44 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more co

[GitHub] [iceberg] RussellSpitzer commented on issue #6424: The size estimation formula for spark task is incorrect

2022-12-14 Thread GitBox
RussellSpitzer commented on issue #6424: URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351901052 > here length() is the amount of bytes scanned ( only partially read) https://github.com/apache/iceberg/blob/33217abf7f88c6c22a8c43b320f9de48de998b94/api/src/main/java/o

[GitHub] [iceberg] RussellSpitzer commented on a diff in pull request #6344: Spark 3.3: Introduce the changelog iterator

2022-12-14 Thread GitBox
RussellSpitzer commented on code in PR #6344: URL: https://github.com/apache/iceberg/pull/6344#discussion_r1048813554 ## spark/v3.3/spark/src/test/java/org/apache/iceberg/spark/TestChangelogIterator.java: ## @@ -0,0 +1,96 @@ +/* + * Licensed to the Apache Software Foundation (AS

[GitHub] [iceberg] flyrain commented on pull request #6350: Query changelog table with a timestamp range

2022-12-14 Thread GitBox
flyrain commented on PR #6350: URL: https://github.com/apache/iceberg/pull/6350#issuecomment-1351909259 Thanks @RussellSpitzer. Hi @szehon-ho @hililiwei , please let me know if you have any comments. Thanks1 -- This is an automated message from the Apache Git Service. To respond to the me

[GitHub] [iceberg] dmgcodevil commented on a diff in pull request #3337: Fixed issue #3336: Best efforts to release hive table lock

2022-12-14 Thread GitBox
dmgcodevil commented on code in PR #3337: URL: https://github.com/apache/iceberg/pull/3337#discussion_r1048815939 ## hive-metastore/src/main/java/org/apache/iceberg/hive/HiveTableOperations.java: ## @@ -499,11 +499,21 @@ private void unlock(Optional lockId) { } @VisibleF

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

2022-12-14 Thread GitBox
ahshahid commented on issue #6424: URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351924900 Right.. I was also thinking that this is where I have a misunderstanding or bug... The question is : where the recordCount represents the scanned fraction row count, or the to

[GitHub] [iceberg] RussellSpitzer commented on issue #6424: The size estimation formula for spark task is incorrect

2022-12-14 Thread GitBox
RussellSpitzer commented on issue #6424: URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351938390 Record count does not represent the scanned fraction. I linked you to the code, it's a representation of a row in a manifestFile which is a the metadata for the entire file.

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

2022-12-14 Thread GitBox
ahshahid commented on issue #6424: URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1351939955 I see. let me see if I can explain what I mean by test... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use th

[GitHub] [iceberg] asheeshgarg commented on issue #6415: Vectorized Read Issue

2022-12-14 Thread GitBox
asheeshgarg commented on issue #6415: URL: https://github.com/apache/iceberg/issues/6415#issuecomment-1351998029 @nastra thanks for the references seems to be the case. @rdblue this seems to be really the case with lot of datasets. Do we have any time line when https://github.com/apache/ice

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

2022-12-14 Thread GitBox
ahshahid commented on issue #6424: URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1352025232 @RussellSpitzer : I see what you are saying about record count corresponding to total file size. Let me look into what is causing something wrong in my test for join -- This is

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

2022-12-14 Thread GitBox
ahshahid commented on issue #6424: URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1352040671 @RussellSpitzer : apologies for bugging , I was hoping one more clarification on this aspect: long splitOffset = (file().splitOffsets() != null) ? file().splitOffsets().get(0)

[GitHub] [iceberg] RussellSpitzer commented on issue #6424: The size estimation formula for spark task is incorrect

2022-12-14 Thread GitBox
RussellSpitzer commented on issue #6424: URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1352058762 Parquet files have non-data metadata which is not scanned when we read the split. So if for example our first row-group starts at byte 1000, we don't want to count 1000 bytes

[GitHub] [iceberg] flyrain commented on pull request #6350: Query changelog table with a timestamp range

2022-12-14 Thread GitBox
flyrain commented on PR #6350: URL: https://github.com/apache/iceberg/pull/6350#issuecomment-1352094105 Thanks @szehon-ho for the review. I believe you are talking about the case 2 in https://github.com/apache/iceberg/pull/6350#discussion_r1044906141. I did try to return an empty set, but i

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

2022-12-14 Thread GitBox
ahshahid commented on issue #6424: URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1352159451 Thank you @RussellSpitzer ..I will close this.. may be issue I m seeing is conversion of double to long for fractional value. Will update once I debug more. Sorry for false alarm

[GitHub] [iceberg] ahshahid closed issue #6424: The size estimation formula for spark task is incorrect

2022-12-14 Thread GitBox
ahshahid closed issue #6424: The size estimation formula for spark task is incorrect URL: https://github.com/apache/iceberg/issues/6424 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific c

[GitHub] [iceberg] ddrinka commented on issue #2768: Support fan-out reads in PyIceberg

2022-12-14 Thread GitBox
ddrinka commented on issue #2768: URL: https://github.com/apache/iceberg/issues/2768#issuecomment-1352288435 > We could do this with Spark or Dask or Ray depending on what's installed on the system. Perhaps consider [Modin](https://github.com/modin-project/modin) as well? -- This i

[GitHub] [iceberg] stevenzwu commented on pull request #6426: Flink: add fixed field type for DataGenerators test util

2022-12-14 Thread GitBox
stevenzwu commented on PR #6426: URL: https://github.com/apache/iceberg/pull/6426#issuecomment-1352289861 @pvary regarding your other comments, I am not sure how to proceed yet. > Null values for null values, would optional primitive types at top level be enough? > Edge

[GitHub] [iceberg] yegangy0718 commented on a diff in pull request #6382: Implement ShuffleOperator to collect data statistics

2022-12-14 Thread GitBox
yegangy0718 commented on code in PR #6382: URL: https://github.com/apache/iceberg/pull/6382#discussion_r1048051789 ## flink/v1.16/flink/src/test/java/org/apache/iceberg/flink/sink/shuffle/TestShuffleOperator.java: ## @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software Foun

[GitHub] [iceberg] flyrain merged pull request #6350: Spark 3.3: Time range query of changelog tables

2022-12-14 Thread GitBox
flyrain merged PR #6350: URL: https://github.com/apache/iceberg/pull/6350 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apa

[GitHub] [iceberg] github-actions[bot] commented on issue #5071: Support UNSET of sortOrder from the SQL

2022-12-14 Thread GitBox
github-actions[bot] commented on issue #5071: URL: https://github.com/apache/iceberg/issues/5071#issuecomment-1352388065 This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs.

[GitHub] [iceberg] github-actions[bot] commented on issue #4948: Create a Github Action to automatically mark issues as stale and later close if inactive

2022-12-14 Thread GitBox
github-actions[bot] commented on issue #4948: URL: https://github.com/apache/iceberg/issues/4948#issuecomment-1352388123 This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale' -- This is an automated message from the Apache Gi

[GitHub] [iceberg] github-actions[bot] closed issue #4948: Create a Github Action to automatically mark issues as stale and later close if inactive

2022-12-14 Thread GitBox
github-actions[bot] closed issue #4948: Create a Github Action to automatically mark issues as stale and later close if inactive URL: https://github.com/apache/iceberg/issues/4948 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub a

[GitHub] [iceberg] dennishuo opened a new pull request, #6428: Add new SnowflakeCatalog implementation to enable directly using Snowflake-managed Iceberg tables

2022-12-14 Thread GitBox
dennishuo opened a new pull request, #6428: URL: https://github.com/apache/iceberg/pull/6428 This read-only implementation of the Catalog interface, initially built on top of the [Snowflake JDBC driver](https://docs.snowflake.com/en/user-guide/jdbc.html) for the connection layer, enables e

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

2022-12-14 Thread GitBox
ahshahid commented on issue #6424: URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1352519479 @RussellSpitzer . actually part of the issue what I was seeing was related to scannedFraction approximately equal to 1, but record count of 1., which was resulting in net rows seen

[GitHub] [iceberg] ahshahid commented on issue #6424: The size estimation formula for spark task is incorrect

2022-12-14 Thread GitBox
ahshahid commented on issue #6424: URL: https://github.com/apache/iceberg/issues/6424#issuecomment-1352520299 The other issue which I was looking at was comparing perf of parquet with iceberg and in that it seems, that iceberg because of better size estimation as compared to parquet, resul

[GitHub] [iceberg] xwmr-max commented on pull request #6412: Doc: Modify some options refer to Read-options in flink streaming rea…

2022-12-14 Thread GitBox
xwmr-max commented on PR #6412: URL: https://github.com/apache/iceberg/pull/6412#issuecomment-1352605298 @openinx -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubs

[GitHub] [iceberg] pvary commented on a diff in pull request #6382: Implement ShuffleOperator to collect data statistics

2022-12-14 Thread GitBox
pvary commented on code in PR #6382: URL: https://github.com/apache/iceberg/pull/6382#discussion_r1049262265 ## flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/ShuffleOperator.java: ## @@ -0,0 +1,140 @@ +/* + * Licensed to the Apache Software Foundation (AS

[GitHub] [iceberg] jiamin13579 commented on pull request #6419: Doc:Example of correcting the document add/drop partition truncate

2022-12-14 Thread GitBox
jiamin13579 commented on PR #6419: URL: https://github.com/apache/iceberg/pull/6419#issuecomment-1352635716 > Not sure its necessary, looks like for now width can be any of the arguments: https://github.com/apache/iceberg/blob/master/spark/v3.3/spark/src/main/java/org/apache/iceberg/spark/S

[GitHub] [iceberg] Fokko commented on a diff in pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

2022-12-14 Thread GitBox
Fokko commented on code in PR #6392: URL: https://github.com/apache/iceberg/pull/6392#discussion_r1049285936 ## python/Makefile: ## @@ -26,14 +26,21 @@ lint: poetry run pre-commit run --all-files test: - poetry run coverage run --source=pyiceberg/ -m pytest test

[GitHub] [iceberg] nastra commented on a diff in pull request #6428: Add new SnowflakeCatalog implementation to enable directly using Snowflake-managed Iceberg tables

2022-12-14 Thread GitBox
nastra commented on code in PR #6428: URL: https://github.com/apache/iceberg/pull/6428#discussion_r1049312756 ## snowflake/src/test/java/org/apache/iceberg/snowflake/SnowflakeCatalogTest.java: ## @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under on