Re: [PR] Spec: Clarify which columns can be used for equality delete files. [iceberg]

2023-11-08 Thread via GitHub


gaborkaszab commented on code in PR #8981:
URL: https://github.com/apache/iceberg/pull/8981#discussion_r1386184942


##
format/spec.md:
##
@@ -842,7 +842,8 @@ The rows in the delete file must be sorted by `file_path` 
then `pos` to optimize
 
 Equality delete files identify deleted rows in a collection of data files by 
one or more column values, and may optionally contain additional columns of the 
deleted row.
 
-Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). Float and double columns cannot 
be used as delete columns in equality delete files.
+Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). The column restrictions for 
columns used in equality delete files are the same as those for [identifier 
fields](#identifier-field-ids) with the exception that optional columns and 
columns nested under optional structs are allowed (if a 
+parent struct column is null it implies the leaf column is null).

Review Comment:
   I have one implementation related objection against treating `NULL`s in the 
delete files as `IS NULL`s:
   This might be different in each query engine but in general I think it's in 
planning time when you create the predicates for the query, but the information 
that a row in the delete file contains `NULL` will be known only during the 
execution phase when you read the actual rows. But at that point you won't be 
exchanging an equality predicate to an `IS NULL` predicate, right?
   Do you know if query engines already are able to get around this problem and 
treat `NULL` values in the delete files as `IS NULL`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Shift site build to use monorepo and gh-pages [iceberg]

2023-11-08 Thread via GitHub


Fokko commented on code in PR #8919:
URL: https://github.com/apache/iceberg/pull/8919#discussion_r1386152391


##
site/dev/common.sh:
##
@@ -0,0 +1,125 @@
+#!/usr/bin/env bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+set -e
+
+mv_nightly_up () {
+  mv docs/docs/nightly/ docs/
+}
+
+mv_nightly_down () {
+  mv docs/nightly/ docs/docs/
+}
+
+uninstall_deps () {
+  if test -f .env; then
+deactivate
+rm -r .env
+  fi
+}
+
+install_deps () {
+  uninstall_deps
+  python -m venv .env

Review Comment:
   This already assumes that you have `virtualenv` installed. Do we actually 
need a `venv` for this simple setup? We could also use the system `python3`



##
site/README.md:
##
@@ -54,18 +54,34 @@ The non-versioned site pages are all the `/site/docs/.*md` 
files and the docs ar
 │   ├── about.md
 │   ├── ...
 │   └── view-spec.md
-├── README.md
+├── ...
+├── Makefile
 ├── mkdocs.yml (non-versioned)
-├── requirements.txt
-└── variables.yml
+└── requirements.txt
 ```
 ### Building the versioned docs
 
-> [!IMPORTANT]  
-> This build process is currently missing older versions and the javadoc 
branches.
-> Until these branches are merged, these steps will not work.
+The Iceberg versioned docs are committed in the orphan `docs` branch and 
mounted using [git worktree](https://git-scm.com/docs/git-worktree) at build 
time. The `docs` branch contains the versioned documenation source files at the 
root. These versions exist at the `/site/docs/docs/` directory once 
mounted. The `latest` version, is a soft link to the most recent build version 
in the `docs` branch. There is also a `javadoc` branch that contains all prior 
static generation versions of the javadocs that is mounted at 
`/site/docs/javadoc/`.
 
-All previously versioned docs will be committed in `docs-` branches 
and mounted using [git worktree](https://git-scm.com/docs/git-worktree) at 
build time. The worktree will pull these versions in following the 
`/site/docs/docs/` convention. The `latest` version, will be a 
secondary copy of the most recent build version in the worktree, but pointing 
to `/site/docs/docs/latest`. There is also a `javadoc` branch that contains all 
prior static generation versions of the javadocs in a single tag.
+The docs are built, run, and released using 
[make](https://www.gnu.org/software/make/manual/make.html). The 
[Makefile](Makefile) and the [common shell script](dev/common.sh) support the 
following command:
+
+```
+site > make help
+[build](dev/build.sh): Clean and build the site locally.
+[clean](dev/clean.sh): Clean the local site.
+[deploy](dev/deploy.sh): Clean, build, and deploy the Iceberg docs site.
+help: Show help for each of the Makefile recipes.
+[release](dev/release.sh): Release the current nightly docs as ICEBERG_VERSION 
(make release ICEBERG_VERSION=).
+[serve](dev/serve.sh): Clean, build, and run the site locally.
+```

Review Comment:
   The links don't render inside of the code block:
   
![image](https://github.com/apache/iceberg/assets/1134248/5613b33a-b25c-474d-b686-402337343aea)
   
   ```suggestion
   - [build](dev/build.sh): Clean and build the site locally.
   - [clean](dev/clean.sh): Clean the local site.
   - [deploy](dev/deploy.sh): Clean, build, and deploy the Iceberg docs site.
   - [release](dev/release.sh): Release the current nightly docs as 
ICEBERG_VERSION (make release ICEBERG_VERSION=).
   - [serve](dev/serve.sh): Clean, build, and run the site locally.
   ```



##
site/dev/common.sh:
##
@@ -0,0 +1,125 @@
+#!/usr/bin/env bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, eit

Re: [PR] Shift site build to use monorepo and gh-pages [iceberg]

2023-11-08 Thread via GitHub


bitsondatadev commented on code in PR #8919:
URL: https://github.com/apache/iceberg/pull/8919#discussion_r1386194229


##
site/.gitignore:
##
@@ -1,13 +1,3 @@
-## Temp remove for first phase

Review Comment:
   Yeah, fair enough...I added a few specifics from my previous one



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] [Feature Request] Implement `equals` for `RESTMessage` [iceberg]

2023-11-08 Thread via GitHub


Fokko commented on issue #9003:
URL: https://github.com/apache/iceberg/issues/9003#issuecomment-1801334447

   @liurenjie1024 No problem and I think that makes sense to be able to check 
those for equality.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Shift site build to use monorepo and gh-pages [iceberg]

2023-11-08 Thread via GitHub


bitsondatadev commented on code in PR #8919:
URL: https://github.com/apache/iceberg/pull/8919#discussion_r1386208436


##
site/dev/common.sh:
##
@@ -0,0 +1,125 @@
+#!/usr/bin/env bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+set -e
+
+mv_nightly_up () {
+  mv docs/docs/nightly/ docs/
+}
+
+mv_nightly_down () {
+  mv docs/nightly/ docs/docs/
+}
+
+uninstall_deps () {
+  if test -f .env; then
+deactivate
+rm -r .env
+  fi
+}
+
+install_deps () {
+  uninstall_deps
+  python -m venv .env
+  source env/bin/activate
+  pip install -r requirements.txt
+}
+
+assert_not_empty () {
+  if [ -z "$1" ]
+then
+  echo "No argument supplied"
+  exit 1
+  fi 
+}
+
+get_latest_version () {
+  basename $(ls -d docs/docs/*/ | sort -V | tail -1)
+}
+
+update_version () {
+  local ICEBERG_VERSION="$1"
+  assert_not_empty "${ICEBERG_VERSION}"
+
+  sed -i '' -E "s/(^site\_name:[[:space:]]+docs\/).*$/\1${ICEBERG_VERSION}/" 
${ICEBERG_VERSION}/mkdocs.yml
+  sed -i '' -E 
"s/(^[[:space:]]*-[[:space:]]+Javadoc:.*\/javadoc\/).*$/\1${ICEBERG_VERSION}/" 
${ICEBERG_VERSION}/mkdocs.yml
+
+  sed -i '' -E "s/(^\- latest: '\!include 
docs\/docs\/latest\/mkdocs\.yml'/a\- ${ICEBERG_VERSION}: '\!include 
docs\/docs\/${ICEBERG_VERSION}\/mkdocs\.yml/" ../../mkdocs.yml
+}
+
+create_latest () {
+  local ICEBERG_VERSION="$1"
+  assert_not_empty "${ICEBERG_VERSION}"
+
+  rm -rf docs/docs/latest/
+  mkdir docs/docs/latest/
+
+  ln -s "../${ICEBERG_VERSION}/docs" docs/docs/latest/docs
+  cp "docs/docs/${ICEBERG_VERSION}/mkdocs.yml" docs/docs/latest/
+
+  cd docs/docs/
+  update_version "latest"
+  cd -
+}
+
+
+# 
https://squidfunk.github.io/mkdocs-material/setup/setting-up-site-search/#search-exclusion
+search_exclude_versioned_docs () {
+  local ICEBERG_VERSION="$1"
+  assert_not_empty "${ICEBERG_VERSION}"
+
+  cd "${ICEBERG_VERSION}/docs/"
+  
+  python3 -c "import os

Review Comment:
   yeesh, yeah...I think best practice is python3 no?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] [Feature Request] Implement `equals` for `RESTMessage` [iceberg]

2023-11-08 Thread via GitHub


liurenjie1024 commented on issue #9003:
URL: https://github.com/apache/iceberg/issues/9003#issuecomment-1801337442

   Cool, I will take this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Shift site build to use monorepo and gh-pages [iceberg]

2023-11-08 Thread via GitHub


bitsondatadev commented on code in PR #8919:
URL: https://github.com/apache/iceberg/pull/8919#discussion_r1386210561


##
site/Makefile:
##
@@ -0,0 +1,38 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+.PHONY: help
+help: # Show help for each of the Makefile recipes.
+   @grep -E '^[a-zA-Z0-9 -]+:.*#'  Makefile | sort | while read -r l; do 
printf "\033[1;32m$$(echo $$l | cut -f 1 -d':')\033[00m:$$(echo $$l | cut -f 2- 
-d'#')\n"; done
+
+.PHONY: serve
+serve: # Clean, build, and run the docs site locally.
+   dev/serve.sh
+
+.PHONY: build
+build: # Clean and build the docs site locally.

Review Comment:
   lol, actually, the comments were a hack for `make help` or default `make` 
output.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Shift site build to use monorepo and gh-pages [iceberg]

2023-11-08 Thread via GitHub


bitsondatadev commented on code in PR #8919:
URL: https://github.com/apache/iceberg/pull/8919#discussion_r1386213158


##
site/README.md:
##
@@ -83,59 +99,27 @@ All previously versioned docs will be committed in 
`docs-` branches and
         └── ...
 ```
 
-### Install
-
-1. (Optional) Set up venv
+To run this, run the `serve` recipe, which runs the `build` recipe and calls 
`mkdocs serve`. This will run locally at .
 ```
-python -m venv mkdocs_env
-source mkdocs_env/bin/activate
+make serve
 ```
 
-1. Install required Python libraries
+To clear all build files, run `clean`.
 ```
-pip install -r requirements.txt
-```
-
- Adding additional versioned documentation
-
-To build locally with additional docs versions, add them to your working tree.
-For now, I'm just adding a single version, and the javadocs directory.
-
-```
-git worktree add site/docs/docs/1.4.0 docs-1.4.0
-git worktree add site/docs/javadoc javadoc
-```
-
-## Build
-
-Run the build command in the root directory, and optionally add `--clean` to 
force MkDocs to clear previously generated pages.
-
-```
-mkdocs build [--clean]
-```
-
-## Run
-
-Start MkDocs server locally to verify the site looks good.
-
-```
-mkdocs serve
+make clean
 ```
 
 ## Release process
 
-Deploying a version of the docs is a two step process:
- 1. ~~Cut a new release from the current branch revision. This creates a new 
branch `docs-`.~~
-
+Deploying the docs is a two step process:
+ 1. Release a new version by copying the current nightly directory to a new 
version directory in the `docs` branch.
 ```
-.github/bin/deploy_docs.sh -v 1.4.0
+make release ICEBERG_VERSION=${ICEBERG_VERSION}
+```
+ 1. Push the generated site to `asf-site`.
+```
+make deploy 

Review Comment:
   No, I guess I should remove this as that basically documents this step. I'll 
only keep build and serve documented in the readme.
   
   Although one last thought is that I just want all this documented so that 
others can read through to understand my intent.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Shift site build to use monorepo and gh-pages [iceberg]

2023-11-08 Thread via GitHub


Fokko commented on code in PR #8919:
URL: https://github.com/apache/iceberg/pull/8919#discussion_r1386219088


##
site/dev/common.sh:
##
@@ -0,0 +1,125 @@
+#!/usr/bin/env bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+set -e
+
+mv_nightly_up () {
+  mv docs/docs/nightly/ docs/
+}
+
+mv_nightly_down () {
+  mv docs/nightly/ docs/docs/
+}
+
+uninstall_deps () {
+  if test -f .env; then
+deactivate
+rm -r .env
+  fi
+}
+
+install_deps () {
+  uninstall_deps
+  python -m venv .env
+  source env/bin/activate
+  pip install -r requirements.txt
+}
+
+assert_not_empty () {
+  if [ -z "$1" ]
+then
+  echo "No argument supplied"
+  exit 1
+  fi 
+}
+
+get_latest_version () {
+  basename $(ls -d docs/docs/*/ | sort -V | tail -1)
+}
+
+update_version () {
+  local ICEBERG_VERSION="$1"
+  assert_not_empty "${ICEBERG_VERSION}"
+
+  sed -i '' -E "s/(^site\_name:[[:space:]]+docs\/).*$/\1${ICEBERG_VERSION}/" 
${ICEBERG_VERSION}/mkdocs.yml
+  sed -i '' -E 
"s/(^[[:space:]]*-[[:space:]]+Javadoc:.*\/javadoc\/).*$/\1${ICEBERG_VERSION}/" 
${ICEBERG_VERSION}/mkdocs.yml
+
+  sed -i '' -E "s/(^\- latest: '\!include 
docs\/docs\/latest\/mkdocs\.yml'/a\- ${ICEBERG_VERSION}: '\!include 
docs\/docs\/${ICEBERG_VERSION}\/mkdocs\.yml/" ../../mkdocs.yml
+}
+
+create_latest () {
+  local ICEBERG_VERSION="$1"
+  assert_not_empty "${ICEBERG_VERSION}"
+
+  rm -rf docs/docs/latest/
+  mkdir docs/docs/latest/
+
+  ln -s "../${ICEBERG_VERSION}/docs" docs/docs/latest/docs
+  cp "docs/docs/${ICEBERG_VERSION}/mkdocs.yml" docs/docs/latest/
+
+  cd docs/docs/
+  update_version "latest"
+  cd -
+}
+
+
+# 
https://squidfunk.github.io/mkdocs-material/setup/setting-up-site-search/#search-exclusion
+search_exclude_versioned_docs () {
+  local ICEBERG_VERSION="$1"
+  assert_not_empty "${ICEBERG_VERSION}"
+
+  cd "${ICEBERG_VERSION}/docs/"
+  
+  python3 -c "import os

Review Comment:
   Yes, I prefer `python3`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] [Feature Request] Implement `equals` for `RESTMessage` [iceberg]

2023-11-08 Thread via GitHub


liurenjie1024 commented on issue #9003:
URL: https://github.com/apache/iceberg/issues/9003#issuecomment-1801347467

   cc @Fokko Please help to assign it to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Clarify which columns can be used for equality delete files. [iceberg]

2023-11-08 Thread via GitHub


liurenjie1024 commented on code in PR #8981:
URL: https://github.com/apache/iceberg/pull/8981#discussion_r138616


##
format/spec.md:
##
@@ -842,7 +842,8 @@ The rows in the delete file must be sorted by `file_path` 
then `pos` to optimize
 
 Equality delete files identify deleted rows in a collection of data files by 
one or more column values, and may optionally contain additional columns of the 
deleted row.
 
-Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). Float and double columns cannot 
be used as delete columns in equality delete files.
+Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). The column restrictions for 
columns used in equality delete files are the same as those for [identifier 
fields](#identifier-field-ids) with the exception that optional columns and 
columns nested under optional structs are allowed (if a 
+parent struct column is null it implies the leaf column is null).

Review Comment:
   I think this is guaranteed if all query engine uses `DeletionFilter` provide 
in iceberg.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Implement load table api. [iceberg-rust]

2023-11-08 Thread via GitHub


Fokko commented on code in PR #89:
URL: https://github.com/apache/iceberg-rust/pull/89#discussion_r1386222676


##
crates/catalog/rest/src/catalog.rs:
##
@@ -312,11 +316,43 @@ impl Catalog for RestCatalog {
 }
 
 /// Load table from the catalog.
-async fn load_table(&self, _table: &TableIdent) -> Result {
-Err(Error::new(
-ErrorKind::FeatureUnsupported,
-"Creating table not supported yet!",
-))
+async fn load_table(&self, table: &TableIdent) -> Result {
+let request = self

Review Comment:
   For Python we split it out, because we're also doing some additional things: 
https://github.com/apache/iceberg-python/blob/main/pyiceberg/catalog/rest.py#L209-L240
 But we can refactor this later as well 👍 



##
crates/catalog/rest/src/catalog.rs:
##
@@ -312,11 +316,43 @@ impl Catalog for RestCatalog {
 }
 
 /// Load table from the catalog.
-async fn load_table(&self, _table: &TableIdent) -> Result {
-Err(Error::new(
-ErrorKind::FeatureUnsupported,
-"Creating table not supported yet!",
-))
+async fn load_table(&self, table: &TableIdent) -> Result {
+let request = self

Review Comment:
   For Python we split it out, because we're also doing some additional things: 
https://github.com/apache/iceberg-python/blob/main/pyiceberg/catalog/rest.py#L209-L240
 But we can refactor this later as well 👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Implement load table api. [iceberg-rust]

2023-11-08 Thread via GitHub


Fokko merged PR #89:
URL: https://github.com/apache/iceberg-rust/pull/89


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: Implement load table api. [iceberg-rust]

2023-11-08 Thread via GitHub


Fokko commented on code in PR #89:
URL: https://github.com/apache/iceberg-rust/pull/89#discussion_r1386224255


##
crates/iceberg/src/table.rs:
##
@@ -17,10 +17,33 @@
 
 //! Table API for Apache Iceberg
 
+use crate::io::FileIO;
 use crate::spec::TableMetadata;
+use crate::TableIdent;
+use typed_builder::TypedBuilder;
 
 /// Table represents a table in the catalog.
+#[derive(TypedBuilder)]
 pub struct Table {
-metadata_location: String,
+file_io: FileIO,
+#[builder(default, setter(strip_option))]
+metadata_location: Option,
 metadata: TableMetadata,
+identifier: TableIdent,
+}
+
+impl Table {
+/// Returns table identifier.

Review Comment:
   In Python this makes a few things easier, but we can also defer that. Easier 
to add later on, than removing it :D



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Deploy 1.4.2 to docs branch / fix 1.4.x docs for search [iceberg]

2023-11-08 Thread via GitHub


bitsondatadev opened a new pull request, #9005:
URL: https://github.com/apache/iceberg/pull/9005

   (no comment)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Add a constructor to StaticTableOperations [iceberg]

2023-11-08 Thread via GitHub


nastra merged PR #8996:
URL: https://github.com/apache/iceberg/pull/8996


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] DELETE fails with "java.lang.IllegalArgumentException: info must be ExtendedLogicalWriteInfo" [iceberg]

2023-11-08 Thread via GitHub


bknbkn commented on issue #8926:
URL: https://github.com/apache/iceberg/issues/8926#issuecomment-1801484253

   may be you can add  .config("spark.sql.extensions", 
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") and try 
again
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-08 Thread via GitHub


nk1506 commented on PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#issuecomment-1801545652

   @pvary , My Bad i didn't understand the initial commends on loading all the 
table data from HMS. As you mentioned initial implementation was not filtering 
to construct the `tableIdentifiers` for `listAllTables`. I have addressed these 
comments. Also followed the same strategy of `InMemoryCatalog` in terms of 
Error for Hive Catalog. 
   Regarding the code duplications (`HiveTableOperations`/`HiveViewOperations`) 
I am creating another PR that can go before this. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Shift site build to use monorepo and gh-pages [iceberg]

2023-11-08 Thread via GitHub


bitsondatadev commented on code in PR #8919:
URL: https://github.com/apache/iceberg/pull/8919#discussion_r1386414038


##
site/dev/common.sh:
##
@@ -0,0 +1,125 @@
+#!/usr/bin/env bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+set -e
+
+mv_nightly_up () {
+  mv docs/docs/nightly/ docs/
+}
+
+mv_nightly_down () {
+  mv docs/nightly/ docs/docs/
+}
+
+uninstall_deps () {
+  if test -d site_env; then
+deactivate
+rm -r site_env
+  fi
+}
+
+install_deps () {
+  uninstall_deps
+  python -m venv site_env
+  source site_env/bin/activate
+  pip install -r requirements.txt
+}
+
+assert_not_empty () {
+  if [ -z "$1" ]
+then
+  echo "No argument supplied"
+  exit 1
+  fi 
+}
+
+get_latest_version () {
+  basename $(ls -d docs/docs/*/ | sort -V | tail -1)
+}
+
+update_version () {
+  local ICEBERG_VERSION="$1"
+  assert_not_empty "${ICEBERG_VERSION}"
+
+  sed -i '' -E "s/(^site\_name:[[:space:]]+docs\/).*$/\1${ICEBERG_VERSION}/" 
${ICEBERG_VERSION}/mkdocs.yml
+  sed -i '' -E 
"s/(^[[:space:]]*-[[:space:]]+Javadoc:.*\/javadoc\/).*$/\1${ICEBERG_VERSION}/" 
${ICEBERG_VERSION}/mkdocs.yml
+
+  sed -i '' -E "s/\- latest: '\!include docs\/docs\/latest\/mkdocs\.yml'/a 
   \- ${ICEBERG_VERSION}: '\!include 
docs\/docs\/${ICEBERG_VERSION}\/mkdocs\.yml/" ../../mkdocs.yml
+}
+
+create_latest () {
+  local ICEBERG_VERSION="$1"
+  assert_not_empty "${ICEBERG_VERSION}"
+
+  rm -rf docs/docs/latest/
+  mkdir docs/docs/latest/
+
+  ln -s "../${ICEBERG_VERSION}/docs" docs/docs/latest/docs
+  cp "docs/docs/${ICEBERG_VERSION}/mkdocs.yml" docs/docs/latest/
+
+  cd docs/docs/
+  update_version "latest"
+  cd -
+}
+
+
+# 
https://squidfunk.github.io/mkdocs-material/setup/setting-up-site-search/#search-exclusion
+search_exclude_versioned_docs () {
+  local ICEBERG_VERSION="$1"
+  assert_not_empty "${ICEBERG_VERSION}"
+
+  cd "${ICEBERG_VERSION}/docs/"
+  
+  python3 -c "import os
+for f in filter(lambda x: x.endswith('.md'), os.listdir()): lines = 
open(f).readlines(); open(f, 'w').writelines(lines[:2] + ['search:\n', '  
exclude: true\n'] + lines[2:]);"
+  
+  cd -
+}
+
+pull_versioned_docs () {
+  mv_nightly_up 
+  
+  git worktree add docs/docs docs
+  git worktree add docs/javadoc javadoc

Review Comment:
   Should be good now!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]

2023-11-08 Thread via GitHub


nk1506 commented on code in PR #8918:
URL: https://github.com/apache/iceberg/pull/8918#discussion_r1386428086


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java:
##
@@ -261,6 +261,12 @@ public void renameTable(TableIdentifier from, 
TableIdentifier originalTo) {
 } catch (InterruptedException e) {
   Thread.currentThread().interrupt();
   throw new RuntimeException("Interrupted in call to rename", e);
+} catch (RuntimeException e) {

Review Comment:
   I think alterTable is not throwing `AlreadyExistsException`.  As I can see 
for both Hive2 and Hive3 it it throwing 
[RuntimeException](https://github.com/apache/hive/blob/4df4d75bf1e16fe0af75aad0b4179c34c07fc975/standalone-metastore/src/main/java/org/apache/hadoop/hive/metastore/HiveAlterHandler.java#L140)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] table created by pyiceberg could not interoperate well with trino [iceberg-python]

2023-11-08 Thread via GitHub


zeddit opened a new issue, #134:
URL: https://github.com/apache/iceberg-python/issues/134

   ### Apache Iceberg version
   
   None
   
   ### Please describe the bug 🐞
   
   I am using hive metastore.
   
   when accessing a table created with pyiceberg using below statements
   ```
   schema = Schema(
   NestedField(field_id=1, name="date", field_type=DateType(), 
required=True)
   )
   
   catalog.create_table(
   identifier="test.table1",
   schema=schema,
   location='s3a://test/table1',
   properties={'write.format.default':'PARQUET'}
   )
   ```
   trino pops
   ```
   $ describe iceberg.test.table1;
failed: Cannot parse missing long: current-snapshot-id
   ```
   when using `pyiceberg describe` to check the schema of created table, it 
shows
   ```
   pyiceberg describe sort_order_exp.csdate
   
   Table format version  2
   Metadata location 
s3a://test/table1/metadata/0-e6826c66-a45b-453d-a8b2-c80132b117…
   Table UUIDca9a2a8b-8737-4f96-aaaf-55ff11462225
   Last Updated  1699440150722
   Partition spec[]
   Sort order[]
   Current schemaSchema, id=0
 └── 1: date: required date
   Current snapshot  None
   Snapshots Snapshots
   Propertieswrite.format.default  PARQUET
   ```
   while for tables created in trino, all of them have at least one snapshot 
and the Current snapshot is set.
   
   pyicberg is not able to set and update Current snapshot so table created by 
pyiceberg now could not interoperate well with trino.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Deploy 1.4.2 to docs branch / fix 1.4.x docs for search [iceberg]

2023-11-08 Thread via GitHub


Fokko merged PR #9005:
URL: https://github.com/apache/iceberg/pull/9005


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] hive integration iceberg related problems [iceberg]

2023-11-08 Thread via GitHub


pvary commented on issue #8993:
URL: https://github.com/apache/iceberg/issues/8993#issuecomment-1801696625

   > @pvary However, writing data does have problems. The data directory is 
generated normally, and the data file is also generated normally, but the 
metadata directory only has the matedata.json file and version-tint.text file
   
   So if I understand correctly this confirms, that the issues is with the 
write, and not the read.
   Am I right?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] feat: support ser/deser of value [iceberg-rust]

2023-11-08 Thread via GitHub


liurenjie1024 commented on PR #82:
URL: https://github.com/apache/iceberg-rust/pull/82#issuecomment-1801698341

   cc @Xuanwo @Fokko Any other comments?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Flink: Add support for Flink 1.18 [iceberg]

2023-11-08 Thread via GitHub


pvary commented on issue #8930:
URL: https://github.com/apache/iceberg/issues/8930#issuecomment-1801714514

   I have 2 PRs in progress, which I would like to merge first (#8803 and 
#8553). If nobody starts working on this until they are merged, then I will 
move forward with this one. I plan to use the original method of upgrading 
(independent path for every Flink version). I would like to see the 1.18 
available before the end of the year.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Add note that snapshot expiration and cleanup orphan files could corrupt Flink job state [iceberg]

2023-11-08 Thread via GitHub


pvary merged PR #9002:
URL: https://github.com/apache/iceberg/pull/9002


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Add note that snapshot expiration and cleanup orphan files could corrupt Flink job state [iceberg]

2023-11-08 Thread via GitHub


pvary commented on PR #9002:
URL: https://github.com/apache/iceberg/pull/9002#issuecomment-1801718060

   Thanks @lirui-apache!
   Merged


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Refactor TestHiveCatalog tests to use the core CatalogTests [iceberg]

2023-11-08 Thread via GitHub


pvary commented on code in PR #8918:
URL: https://github.com/apache/iceberg/pull/8918#discussion_r1386492127


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveCatalog.java:
##
@@ -261,6 +261,12 @@ public void renameTable(TableIdentifier from, 
TableIdentifier originalTo) {
 } catch (InterruptedException e) {
   Thread.currentThread().interrupt();
   throw new RuntimeException("Interrupted in call to rename", e);
+} catch (RuntimeException e) {

Review Comment:
   I would be surprised if the HMS API would throw a RuntimeException.
   Some Iceberg code might convert it to a RuntimeException, but the HMS API 
usually throws MetastoreExceptions



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Clarify which columns can be used for equality delete files. [iceberg]

2023-11-08 Thread via GitHub


gaborkaszab commented on code in PR #8981:
URL: https://github.com/apache/iceberg/pull/8981#discussion_r1386511343


##
format/spec.md:
##
@@ -842,7 +842,8 @@ The rows in the delete file must be sorted by `file_path` 
then `pos` to optimize
 
 Equality delete files identify deleted rows in a collection of data files by 
one or more column values, and may optionally contain additional columns of the 
deleted row.
 
-Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). Float and double columns cannot 
be used as delete columns in equality delete files.
+Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). The column restrictions for 
columns used in equality delete files are the same as those for [identifier 
fields](#identifier-field-ids) with the exception that optional columns and 
columns nested under optional structs are allowed (if a 
+parent struct column is null it implies the leaf column is null).

Review Comment:
   I don't think we can take it guaranteed that all the engines use 
`DeleteFilter`. For instance Impala doesn't.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Shift site build to use monorepo and gh-pages [iceberg]

2023-11-08 Thread via GitHub


Fokko commented on code in PR #8919:
URL: https://github.com/apache/iceberg/pull/8919#discussion_r1386524118


##
site/dev/common.sh:
##
@@ -0,0 +1,122 @@
+#!/usr/bin/env bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements.  See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+set -e
+
+mv_nightly_up () {
+  mv docs/docs/nightly/ docs/
+}
+
+mv_nightly_down () {
+  mv docs/nightly/ docs/docs/
+}
+
+install_deps () {
+  pip install -r requirements.txt --upgrade
+}
+
+assert_not_empty () {
+  if [ -z "$1" ]
+then
+  echo "No argument supplied"
+  exit 1
+  fi 
+}
+
+get_latest_version () {
+  basename $(ls -d docs/docs/*/ | sort -V | tail -1)
+}
+
+update_version () {
+  local ICEBERG_VERSION="$1"
+  assert_not_empty "${ICEBERG_VERSION}"
+
+  sed -i '' -E "s/(^site\_name:[[:space:]]+docs\/).*$/\1${ICEBERG_VERSION}/" 
${ICEBERG_VERSION}/mkdocs.yml
+  sed -i '' -E 
"s/(^[[:space:]]*-[[:space:]]+Javadoc:.*\/javadoc\/).*$/\1${ICEBERG_VERSION}/" 
${ICEBERG_VERSION}/mkdocs.yml
+
+  #sed -i '' -E "s/\- latest: '\!include 
docs\/docs\/latest\/mkdocs\.yml'/a\- ${ICEBERG_VERSION}: '\!include 
docs\/docs\/${ICEBERG_VERSION}\/mkdocs\.yml/" ../../mkdocs.yml
+}
+
+create_latest () {
+  local ICEBERG_VERSION="$1"
+  assert_not_empty "${ICEBERG_VERSION}"
+
+  rm -rf docs/docs/latest/
+  mkdir docs/docs/latest/
+
+  ln -s "../${ICEBERG_VERSION}/docs" docs/docs/latest/docs
+  cp "docs/docs/${ICEBERG_VERSION}/mkdocs.yml" docs/docs/latest/
+
+  cd docs/docs/
+  update_version "latest"
+  cd -
+}
+
+
+# 
https://squidfunk.github.io/mkdocs-material/setup/setting-up-site-search/#search-exclusion
+search_exclude_versioned_docs () {
+  local ICEBERG_VERSION="$1"
+  assert_not_empty "${ICEBERG_VERSION}"
+
+  cd "${ICEBERG_VERSION}/docs/"
+  
+  python3 -c "import os
+for f in filter(lambda x: x.endswith('.md'), os.listdir()): lines = 
open(f).readlines(); open(f, 'w').writelines(lines[:2] + ['search:\n', '  
exclude: true\n'] + lines[2:]);"
+  
+  cd -
+}
+
+pull_versioned_docs () {
+  mv_nightly_up 
+  
+  git worktree add docs/docs docs
+  git worktree add docs/javadoc javadoc

Review Comment:
   ```suggestion
 git remote add iceberg-docs https://github.com/apache/iceberg.git
 git remote fetch iceberg-docs
 git worktree add docs/docs iceberg-docs/docs
 git worktree add docs/javadoc iceberg-docs/javadoc
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Docs: Add note that snapshot expiration and cleanup orphan files could corrupt Flink job state [iceberg]

2023-11-08 Thread via GitHub


lirui-apache commented on PR #9002:
URL: https://github.com/apache/iceberg/pull/9002#issuecomment-1801779507

   Thanks @pvary !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Iceberg streaming streaming-skip-overwrite-snapshots SparkMicroBatchStream only skips over one file per trigger [iceberg]

2023-11-08 Thread via GitHub


cccs-jc commented on code in PR #8980:
URL: https://github.com/apache/iceberg/pull/8980#discussion_r1386580961


##
core/src/main/java/org/apache/iceberg/MicroBatches.java:
##
@@ -92,7 +92,7 @@ private static List> 
indexManifests(
 
 for (ManifestFile manifest : manifestFiles) {
   manifestIndexes.add(Pair.of(manifest, currentFileIndex));
-  currentFileIndex += manifest.addedFilesCount() + 
manifest.existingFilesCount();
+  currentFileIndex += manifest.addedFilesCount();

Review Comment:
   ya none of the snapshots have an `existing file count` value.
   
   when I look at production tables it's pretty rare actually to see that. I 
don't know what makes the `existing file count` be set.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Clarify which columns can be used for equality delete files. [iceberg]

2023-11-08 Thread via GitHub


liurenjie1024 commented on code in PR #8981:
URL: https://github.com/apache/iceberg/pull/8981#discussion_r1386629666


##
format/spec.md:
##
@@ -842,7 +842,8 @@ The rows in the delete file must be sorted by `file_path` 
then `pos` to optimize
 
 Equality delete files identify deleted rows in a collection of data files by 
one or more column values, and may optionally contain additional columns of the 
deleted row.
 
-Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). Float and double columns cannot 
be used as delete columns in equality delete files.
+Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). The column restrictions for 
columns used in equality delete files are the same as those for [identifier 
fields](#identifier-field-ids) with the exception that optional columns and 
columns nested under optional structs are allowed (if a 
+parent struct column is null it implies the leaf column is null).

Review Comment:
   Yes, that's a problem of different  language implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] Spark does not support time [iceberg]

2023-11-08 Thread via GitHub


tundraraj opened a new issue, #9006:
URL: https://github.com/apache/iceberg/issues/9006

   ### Feature Request / Improvement
   
   Spark does not support time per 
https://iceberg.apache.org/docs/latest/spark-writes/#iceberg-type-to-spark-type.
   see thread 
https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1699395544460329
   
   
   ### Query engine
   
   Spark


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Fix Migrate procedure renaming issue for custom catalog [iceberg]

2023-11-08 Thread via GitHub


tomtongue commented on code in PR #8931:
URL: https://github.com/apache/iceberg/pull/8931#discussion_r1386791377


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateTableSparkAction.java:
##
@@ -108,6 +109,23 @@ public MigrateTableSparkAction backupTableName(String 
tableName) {
 return this;
   }
 
+  @Override
+  public MigrateTableSparkAction destCatalogName(String catalogName) {
+CatalogManager catalogManager = spark().sessionState().catalogManager();
+
+CatalogPlugin catalogPlugin;
+if (catalogManager.isCatalogRegistered(catalogName)) {
+  catalogPlugin = catalogManager.catalog(catalogName);
+} else {
+  LOG.warn(
+  "{} doesn't exist in SparkSession. " + "Fallback to current 
SparkSession catalog.",
+  catalogName);
+  catalogPlugin = catalogManager.currentCatalog();
+}
+this.destCatalog = checkDestinationCatalog(catalogPlugin);

Review Comment:
   Thanks for the review, @singhpk234. Yes, as you're saying, the Iceberg 
GlueCatalogImpl replicates the "partial" metadata in the rename. So if the 
source Spark/Hive table is partitioned, the restore process will fail as 
follows:
   
   ```
   23/11/06 09:54:03 INFO MigrateTableSparkAction: Generating Iceberg metadata 
for db.tbl in s3://bucket/path/tbl/metadata
   23/11/06 09:54:03 WARN BaseCatalogToHiveConverter: Hive Exception type not 
found for AccessDeniedException
   23/11/06 09:54:05 INFO ClientConfigurationFactory: Set initial getObject 
socket timeout to 2000 ms.
   23/11/06 09:54:06 INFO CodeGenerator: Code generated in 230.388332 ms
   23/11/06 09:54:06 INFO CodeGenerator: Code generated in 17.169875 ms
   23/11/06 09:54:06 INFO CodeGenerator: Code generated in 18.598328 ms
   23/11/06 09:54:07 ERROR MigrateTableSparkAction: Failed to perform the 
migration, aborting table creation and restoring the original table
   23/11/06 09:54:07 INFO MigrateTableSparkAction: Restoring db.tbl from 
db.tbl_backup
   23/11/06 09:54:08 INFO GlueCatalog: created rename destination table 
garbagedb.iceberg_migrate_w_year_partition
   23/11/06 09:54:08 INFO GlueCatalog: Successfully dropped table db.tbl_backup 
from Glue
   23/11/06 09:54:08 INFO GlueCatalog: Dropped table: db.tbl_backup
   23/11/06 09:54:08 INFO GlueCatalog: Successfully renamed table from 
db.tbl_backup to garbagedb.iceberg_migrate_w_year_partition
   Exception in thread "main" 
org.apache.iceberg.exceptions.ValidationException: Unable to get partition spec 
for table: `db`.`tbl_backup`
at 
org.apache.iceberg.spark.SparkExceptionUtil.toUncheckedException(SparkExceptionUtil.java:55)
at 
org.apache.iceberg.spark.SparkTableUtil.importSparkTable(SparkTableUtil.java:415)
at 
org.apache.iceberg.spark.SparkTableUtil.importSparkTable(SparkTableUtil.java:460)
   ...
   Caused by: org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Table tbl_backup is not a 
partitioned table
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:133)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.doListPartitions(HiveExternalCatalog.scala:1308)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitions(HiveExternalCatalog.scala:1302)
   ...
   Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Table 
tbl_backup is not a partitioned table
at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2676)
at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2709)
   ...
   ```
   
   This error was caused by the partition lost in the renamed table. 
   
   So as you know, the way to resolve the migrate restriction, supporting the 
rename for GlueHiveMetastoreClient should be the best. 
   
   At least there are people who have tried to migrate from their table into 
Iceberg on custom catalog like Glue Catalog. But the migrate query cannot be 
used because of the rename restriction. So let me consider the better way to 
resolve this issue. If there's no way to resolve this issue, I think we need to 
ask the GlueHiveMetastoreClient to support the rename operation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Add View support for REST catalog [iceberg]

2023-11-08 Thread via GitHub


nastra commented on code in PR #7913:
URL: https://github.com/apache/iceberg/pull/7913#discussion_r1386793375


##
open-api/rest-catalog-open-api.yaml:
##
@@ -1630,6 +1990,102 @@ components:
 metadata-log:
   $ref: '#/components/schemas/MetadataLog'
 
+SQLViewRepresentation:
+  type: object
+  required:
+- type
+- sql
+- dialect
+  properties:
+type:
+  type: string
+sql:
+  type: string
+dialect:
+  type: string
+
+ViewRepresentation:
+  oneOf:
+- $ref: '#/components/schemas/SQLViewRepresentation'
+
+ViewHistoryEntry:
+  type: object
+  required:
+- version-id
+- timestamp-ms
+  properties:
+version-id:
+  type: integer
+timestamp-ms:
+  type: integer
+  format: int64
+
+ViewVersion:
+  type: object
+  required:
+- version-id
+- timestamp-ms
+- schema-id
+- summary
+- representations
+- default-namespace
+  properties:
+version-id:
+  type: integer
+timestamp-ms:
+  type: integer
+  format: int64
+schema-id:
+  type: integer

Review Comment:
   this should be done now with the last commit



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Add View support for REST catalog [iceberg]

2023-11-08 Thread via GitHub


nastra commented on code in PR #7913:
URL: https://github.com/apache/iceberg/pull/7913#discussion_r1386806741


##
core/src/main/java/org/apache/iceberg/view/ViewMetadata.java:
##
@@ -257,8 +258,8 @@ public Builder addVersion(ViewVersion version) {
   return this;
 }
 
-private int addVersionInternal(ViewVersion version) {
-  int newVersionId = reuseOrCreateNewViewVersionId(version);
+private int addVersionInternal(ViewVersion newVersion) {

Review Comment:
   I've renamed this parameter to make the code slightly easier to understand, 
because we have a few cases in this method where we re-assign either the schema 
id or the view version id



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Fix Migrate procedure renaming issue for custom catalog [iceberg]

2023-11-08 Thread via GitHub


tomtongue commented on code in PR #8931:
URL: https://github.com/apache/iceberg/pull/8931#discussion_r1386791377


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateTableSparkAction.java:
##
@@ -108,6 +109,23 @@ public MigrateTableSparkAction backupTableName(String 
tableName) {
 return this;
   }
 
+  @Override
+  public MigrateTableSparkAction destCatalogName(String catalogName) {
+CatalogManager catalogManager = spark().sessionState().catalogManager();
+
+CatalogPlugin catalogPlugin;
+if (catalogManager.isCatalogRegistered(catalogName)) {
+  catalogPlugin = catalogManager.catalog(catalogName);
+} else {
+  LOG.warn(
+  "{} doesn't exist in SparkSession. " + "Fallback to current 
SparkSession catalog.",
+  catalogName);
+  catalogPlugin = catalogManager.currentCatalog();
+}
+this.destCatalog = checkDestinationCatalog(catalogPlugin);

Review Comment:
   Thanks for the review, @singhpk234. Yes, as you're saying, the Iceberg 
GlueCatalogImpl replicates the "partial" metadata in the rename. So if the 
source Spark/Hive table is partitioned, the restore process will fail as 
follows:
   
   ```
   23/11/06 09:54:03 INFO MigrateTableSparkAction: Generating Iceberg metadata 
for db.tbl in s3://bucket/path/tbl/metadata
   23/11/06 09:54:03 WARN BaseCatalogToHiveConverter: Hive Exception type not 
found for AccessDeniedException
   23/11/06 09:54:05 INFO ClientConfigurationFactory: Set initial getObject 
socket timeout to 2000 ms.
   23/11/06 09:54:06 INFO CodeGenerator: Code generated in 230.388332 ms
   23/11/06 09:54:06 INFO CodeGenerator: Code generated in 17.169875 ms
   23/11/06 09:54:06 INFO CodeGenerator: Code generated in 18.598328 ms
   23/11/06 09:54:07 ERROR MigrateTableSparkAction: Failed to perform the 
migration, aborting table creation and restoring the original table
   23/11/06 09:54:07 INFO MigrateTableSparkAction: Restoring db.tbl from 
db.tbl_backup
   23/11/06 09:54:08 INFO GlueCatalog: created rename destination table db.tbl
   23/11/06 09:54:08 INFO GlueCatalog: Successfully dropped table db.tbl_backup 
from Glue
   23/11/06 09:54:08 INFO GlueCatalog: Dropped table: db.tbl_backup
   23/11/06 09:54:08 INFO GlueCatalog: Successfully renamed table from 
db.tbl_backup to garbagedb.iceberg_migrate_w_year_partition
   Exception in thread "main" 
org.apache.iceberg.exceptions.ValidationException: Unable to get partition spec 
for table: `db`.`tbl_backup`
at 
org.apache.iceberg.spark.SparkExceptionUtil.toUncheckedException(SparkExceptionUtil.java:55)
at 
org.apache.iceberg.spark.SparkTableUtil.importSparkTable(SparkTableUtil.java:415)
at 
org.apache.iceberg.spark.SparkTableUtil.importSparkTable(SparkTableUtil.java:460)
   ...
   Caused by: org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Table tbl_backup is not a 
partitioned table
at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:133)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.doListPartitions(HiveExternalCatalog.scala:1308)
at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitions(HiveExternalCatalog.scala:1302)
   ...
   Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Table 
tbl_backup is not a partitioned table
at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2676)
at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2709)
   ...
   ```
   
   This error was caused by the partition lost in the renamed table. 
   
   So as you know, the way to resolve the migrate restriction, supporting the 
rename for GlueHiveMetastoreClient should be the best. 
   
   At least there are people who have tried to migrate from their table into 
Iceberg on custom catalog like Glue Catalog. But the migrate query cannot be 
used because of the rename restriction. So let me consider the better way to 
resolve this issue. If there's no way to resolve this issue, I think we need to 
ask the GlueHiveMetastoreClient to support the rename operation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Support adding an additional `opType` column when creating a table [iceberg]

2023-11-08 Thread via GitHub


klion26 commented on issue #8973:
URL: https://github.com/apache/iceberg/issues/8973#issuecomment-1802132754

   @nastra thanks for the reply. Yes, it looks the same, but seems the 
`change-data-capture` feature was a view created on a exist iceberg table?
   
   our customer want some iceberg table like `cow` behavior, our customer want 
some iceberg table like `cow` behavior, the table contains the whole raw data 
which contains the `deleted data`(labled with additional column `opType`)
   
   Do you think add such feature in iceberg table is a valid feature? thanks 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.5: Fix Migrate procedure renaming issue for custom catalog [iceberg]

2023-11-08 Thread via GitHub


singhpk234 commented on code in PR #8931:
URL: https://github.com/apache/iceberg/pull/8931#discussion_r1386866648


##
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateTableSparkAction.java:
##
@@ -108,6 +109,23 @@ public MigrateTableSparkAction backupTableName(String 
tableName) {
 return this;
   }
 
+  @Override
+  public MigrateTableSparkAction destCatalogName(String catalogName) {
+CatalogManager catalogManager = spark().sessionState().catalogManager();
+
+CatalogPlugin catalogPlugin;
+if (catalogManager.isCatalogRegistered(catalogName)) {
+  catalogPlugin = catalogManager.catalog(catalogName);
+} else {
+  LOG.warn(
+  "{} doesn't exist in SparkSession. " + "Fallback to current 
SparkSession catalog.",
+  catalogName);
+  catalogPlugin = catalogManager.currentCatalog();
+}
+this.destCatalog = checkDestinationCatalog(catalogPlugin);

Review Comment:
   sure, at this point GlueHiveMetastoreClient supporting rename ops seems a 
reasonable solution to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Default table properties not respected when using Spark DataFrame API [iceberg]

2023-11-08 Thread via GitHub


boushphong commented on issue #8265:
URL: https://github.com/apache/iceberg/issues/8265#issuecomment-1802311411

   Still reproducible on 1.4.2.
   Can I take this one? :bow:  @nastra 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[I] using pyiceberg with kerberized hive metastore [iceberg-python]

2023-11-08 Thread via GitHub


saidixith002 opened a new issue, #135:
URL: https://github.com/apache/iceberg-python/issues/135

   ### Question
   
   Hi,
   Can anyone share examples of using pyiceberg with a kerberized hive 
metastore?
   
   ```
   raise TTransportException(type=TTransportException.END_OF_FILE,
   thrift.transport.TTransport.TTransportException: TSocket read 0 bytes
   ```
   same question posted here - https://github.com/apache/iceberg/issues/6229 
   
   Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Default table properties not respected when using Spark DataFrame API [iceberg]

2023-11-08 Thread via GitHub


boushphong commented on issue #8265:
URL: https://github.com/apache/iceberg/issues/8265#issuecomment-1802356534

   actually I don't think this is a bug
   This would produce DDL like Spark-SQL API.
   ```scala
   df.write.partitionBy("vendor_id").saveAsTable("local.nyc.taxis_df")
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] to_pandas() API which converts iceberg table scan to a pd.DataFrame will lost datetime data type and row order [iceberg-python]

2023-11-08 Thread via GitHub


zeddit commented on issue #132:
URL: https://github.com/apache/iceberg-python/issues/132#issuecomment-1802370780

   @rdblue it's great to find `pyiceberg` load data in a **deterministic** way, 
as what Fokko said, the manifests are read sequentially.
   thus we only need to control the way how data is written, we could control 
the order of the final dataframe.
   
   I have made a comprehensive experiment about the order when using different 
table schema(e.g. partition & sorted) and how data is written. I will post my 
report later here.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] DELETE fails with "java.lang.IllegalArgumentException: info must be ExtendedLogicalWriteInfo" [iceberg]

2023-11-08 Thread via GitHub


rafoid commented on issue #8926:
URL: https://github.com/apache/iceberg/issues/8926#issuecomment-1802375058

   When adding
   ```
   --conf 
spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
 \
   ```
   in addition to the set of configs listed above, I can't even start 
`spark-sql`:
   ```
   Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/spark/sql/catalyst/expressions/AnsiCast
at 
org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions.$anonfun$apply$6(IcebergSparkSessionExtensions.scala:54)
at 
org.apache.spark.sql.SparkSessionExtensions.$anonfun$buildResolutionRules$1(SparkSessionExtensions.scala:175)
at 
scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
at 
scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at 
scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.map(TraversableLike.scala:286)
at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at 
org.apache.spark.sql.SparkSessionExtensions.buildResolutionRules(SparkSessionExtensions.scala:175)
at 
org.apache.spark.sql.internal.BaseSessionStateBuilder.customResolutionRules(BaseSessionStateBuilder.scala:219)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder$$anon$1.(HiveSessionStateBuilder.scala:95)
at 
org.apache.spark.sql.hive.HiveSessionStateBuilder.analyzer(HiveSessionStateBuilder.scala:85)
at 
org.apache.spark.sql.internal.BaseSessionStateBuilder.$anonfun$build$2(BaseSessionStateBuilder.scala:369)
at 
org.apache.spark.sql.internal.SessionState.analyzer$lzycompute(SessionState.scala:89)
at 
org.apache.spark.sql.internal.SessionState.analyzer(SessionState.scala:89)
at 
org.apache.spark.sql.internal.SessionState.catalogManager(SessionState.scala:99)
at 
org.apache.spark.sql.internal.CatalogImpl.currentDatabase(CatalogImpl.scala:65)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.currentDB$1(SparkSQLCLIDriver.scala:285)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.promptWithCurrentDB$1(SparkSQLCLIDriver.scala:292)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:296)
at 
org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
   ...
   ```
   Without it, at least I can query tables and get results. With it, I get the 
exception above on startup.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] minor: Provide Debug impl for pub structs #73 [iceberg-rust]

2023-11-08 Thread via GitHub


DeaconDesperado commented on PR #92:
URL: https://github.com/apache/iceberg-rust/pull/92#issuecomment-1802389582

   Rebased and resolved conflicts.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: Clarify which columns can be used for equality delete files. [iceberg]

2023-11-08 Thread via GitHub


emkornfield commented on code in PR #8981:
URL: https://github.com/apache/iceberg/pull/8981#discussion_r1387018427


##
format/spec.md:
##
@@ -842,7 +842,8 @@ The rows in the delete file must be sorted by `file_path` 
then `pos` to optimize
 
 Equality delete files identify deleted rows in a collection of data files by 
one or more column values, and may optionally contain additional columns of the 
deleted row.
 
-Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). Float and double columns cannot 
be used as delete columns in equality delete files.
+Equality delete files store any subset of a table's columns and use the 
table's field ids. The _delete columns_ are the columns of the delete file used 
to match data rows. Delete columns are identified by id in the delete file 
[metadata column `equality_ids`](#manifests). The column restrictions for 
columns used in equality delete files are the same as those for [identifier 
fields](#identifier-field-ids) with the exception that optional columns and 
columns nested under optional structs are allowed (if a 
+parent struct column is null it implies the leaf column is null).

Review Comment:
   @gaborkaszab I think the concern over nulls should be brought to a different 
forum as it is already part of the spec unambiguously.  
   
   I think the implications for the predicates if you are generating them as 
part of a query plan is that they for nullable columns they need to be written 
as `(delete_file_column is not null && data_file_column = delete_file_column) 
|| (delete file column is null && data file column is null)` for matching rows 
that should be deleted.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]

2023-11-08 Thread via GitHub


jacobmarble commented on PR #8971:
URL: https://github.com/apache/iceberg/pull/8971#issuecomment-1802450224

   > @jacobmarble can you break this into smaller commits? There are a ton of 
files changed here and I'm concerned about catching problems with such a large 
PR.
   
   > I agree with @rdblue
   > 
   > For partition stat we had a PR dedicated for spec. As it's not necessary 
the same reviewers, I think it's good to have one PR for spec (even if it's 
very small) and PRs for api and impl.
   
   Happy to split this up. I tried to scope this to API and Core, but the 
cascade of failing tests led to this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API, Core: implement types timestamp_ns and timestamptz_ns [iceberg]

2023-11-08 Thread via GitHub


jacobmarble closed pull request #8971: API, Core: implement types timestamp_ns 
and timestamptz_ns
URL: https://github.com/apache/iceberg/pull/8971


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] API: implement types timestamp_ns and timestamptz_ns [iceberg]

2023-11-08 Thread via GitHub


jacobmarble opened a new pull request, #9008:
URL: https://github.com/apache/iceberg/pull/9008

   Helps #8657
   
   This change adds field `ChronoUnit unit` to `TimestampType`, such that 
`TimestampType` now represents four specified types:
   - `timestamp` (existing)
   - `timestamptz` (existing)
   - `timestamp_ns` (new #8683)
   - `timestamptz_ns` (new #8683)
   
   Note that TimestampType.with[out]Zone() are marked as deprecated in this 
change. In future PRs, I'll remove usage of these static methods.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] to_pandas() API which converts iceberg table scan to a pd.DataFrame will lost datetime data type and row order [iceberg-python]

2023-11-08 Thread via GitHub


zeddit commented on issue #132:
URL: https://github.com/apache/iceberg-python/issues/132#issuecomment-1802508681

   @Fokko  In conclusion, even though pyiceberg loads data in a deterministic 
way, which means the results is preserved between runs, the results is far from 
arranged, which means we cannot achieve a ordered results without changing the 
logic of pyiceberg on how to assemble the data blocks.
   
   However, it gives out a way to achieve this feature.
   
   The main problem is to deal with partitions, for which even we conduct a 
global sort, the final order in manifest list is still a random one. I will 
introduce what I found during my experiments below.
   
   But first, let me argue the importance of this feature, that is why we need 
a deterministic and in-order results for sorted tables.
   It's mainly about time series data analysis use case, e.g. quant finance, 
which is the one shown in the homepage of `pyiceberg`. In these use cases, the 
order of data, e.g. every stock order and trade has timestamps and always make 
up of a time series data array, and a lot of data science works are conducted 
upon them to search the trends and relativities in time domain.
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] API: implement types timestamp_ns and timestamptz_ns [iceberg]

2023-11-08 Thread via GitHub


jacobmarble commented on PR #9008:
URL: https://github.com/apache/iceberg/pull/9008#issuecomment-1802578562

   Do these need to be addressed in this PR?
   
   ```console
   TestSpark3Util > testDescribeSortOrder FAILED
   org.junit.ComparisonFailure: Sort order isn't correct. 
expected:<[hours(time) DESC NULLS FIRST]> but was:<[]>
   at org.junit.Assert.assertEquals(Assert.java:117)
   at 
org.apache.iceberg.spark.TestSpark3Util.testDescribeSortOrder(TestSpark3Util.java:84)
   
   TestFilteredScan > [format = parquet, vectorized = false, planningMode = 
LOCAL] > testHourPartitionedTimestampFilters[format = parquet, vectorized = 
false, planningMode = LOCAL] FAILED
   java.lang.AssertionError: Primitive value should be equal to expected 
expected:<8> but was:<5>
   at org.junit.Assert.fail(Assert.java:89)
   at org.junit.Assert.failNotEquals(Assert.java:835)
   at org.junit.Assert.assertEquals(Assert.java:120)
   at 
org.apache.iceberg.spark.data.GenericsHelpers.assertEqualsSafe(GenericsHelpers.java:119)
   at 
org.apache.iceberg.spark.data.GenericsHelpers.assertEqualsSafe(GenericsHelpers.java:68)
   at 
org.apache.iceberg.spark.source.TestFilteredScan.assertEqualsSafe(TestFilteredScan.java:573)
   at 
org.apache.iceberg.spark.source.TestFilteredScan.testHourPartitionedTimestampFilters(TestFilteredScan.java:374)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: clarify ns timestamps for ORC deserialization [iceberg]

2023-11-08 Thread via GitHub


rdblue commented on PR #9007:
URL: https://github.com/apache/iceberg/pull/9007#issuecomment-1802656478

   Thanks, @jacobmarble!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spec: clarify ns timestamps for ORC deserialization [iceberg]

2023-11-08 Thread via GitHub


rdblue merged PR #9007:
URL: https://github.com/apache/iceberg/pull/9007


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Support replacing delete manifests [iceberg]

2023-11-08 Thread via GitHub


RussellSpitzer commented on code in PR #9000:
URL: https://github.com/apache/iceberg/pull/9000#discussion_r1387196142


##
core/src/test/java/org/apache/iceberg/TestRewriteManifests.java:
##
@@ -1105,6 +1108,499 @@ public void testRewriteManifestsOnBranchUnsupported() {
 "Cannot commit to branch someBranch: 
org.apache.iceberg.BaseRewriteManifests does not support branch commits");
   }
 
+  @Test
+  public void testRewriteDataManifestsPreservesDeletes() {
+Assumptions.assumeThat(formatVersion).isGreaterThan(1);
+
+Table table = load();
+
+table.newAppend().appendFile(FILE_A).appendFile(FILE_B).commit();
+
+Snapshot appendSnapshot = table.currentSnapshot();
+Assertions.assertThat(appendSnapshot.dataManifests(table.io())).hasSize(1);
+
Assertions.assertThat(appendSnapshot.deleteManifests(table.io())).isEmpty();
+
+
table.newRowDelta().addDeletes(FILE_A_DELETES).addDeletes(FILE_A2_DELETES).commit();
+
+Snapshot deleteSnapshot = table.currentSnapshot();
+Assertions.assertThat(deleteSnapshot.dataManifests(table.io())).hasSize(1);
+
Assertions.assertThat(deleteSnapshot.deleteManifests(table.io())).hasSize(1);
+
+table.rewriteManifests().clusterBy(file -> 
file.path().toString()).commit();

Review Comment:
   nit: Paths are equivalent here too right? Just wondering about the string



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Support replacing delete manifests [iceberg]

2023-11-08 Thread via GitHub


aokolnychyi commented on code in PR #9000:
URL: https://github.com/apache/iceberg/pull/9000#discussion_r1387199381


##
core/src/test/java/org/apache/iceberg/TestRewriteManifests.java:
##
@@ -1105,6 +1108,499 @@ public void testRewriteManifestsOnBranchUnsupported() {
 "Cannot commit to branch someBranch: 
org.apache.iceberg.BaseRewriteManifests does not support branch commits");
   }
 
+  @Test
+  public void testRewriteDataManifestsPreservesDeletes() {
+Assumptions.assumeThat(formatVersion).isGreaterThan(1);
+
+Table table = load();
+
+table.newAppend().appendFile(FILE_A).appendFile(FILE_B).commit();
+
+Snapshot appendSnapshot = table.currentSnapshot();
+Assertions.assertThat(appendSnapshot.dataManifests(table.io())).hasSize(1);
+
Assertions.assertThat(appendSnapshot.deleteManifests(table.io())).isEmpty();
+
+
table.newRowDelta().addDeletes(FILE_A_DELETES).addDeletes(FILE_A2_DELETES).commit();
+
+Snapshot deleteSnapshot = table.currentSnapshot();
+Assertions.assertThat(deleteSnapshot.dataManifests(table.io())).hasSize(1);
+
Assertions.assertThat(deleteSnapshot.deleteManifests(table.io())).hasSize(1);
+
+table.rewriteManifests().clusterBy(file -> 
file.path().toString()).commit();

Review Comment:
   It is `ContentFile$path` that returns `CharSequence`. The idea of this 
closure is to split manifest entries based on the content file path.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Support replacing delete manifests [iceberg]

2023-11-08 Thread via GitHub


RussellSpitzer commented on code in PR #9000:
URL: https://github.com/apache/iceberg/pull/9000#discussion_r1387199271


##
core/src/test/java/org/apache/iceberg/TestRewriteManifests.java:
##
@@ -1105,6 +1108,499 @@ public void testRewriteManifestsOnBranchUnsupported() {
 "Cannot commit to branch someBranch: 
org.apache.iceberg.BaseRewriteManifests does not support branch commits");
   }
 
+  @Test
+  public void testRewriteDataManifestsPreservesDeletes() {
+Assumptions.assumeThat(formatVersion).isGreaterThan(1);
+
+Table table = load();
+
+table.newAppend().appendFile(FILE_A).appendFile(FILE_B).commit();
+
+Snapshot appendSnapshot = table.currentSnapshot();
+Assertions.assertThat(appendSnapshot.dataManifests(table.io())).hasSize(1);
+
Assertions.assertThat(appendSnapshot.deleteManifests(table.io())).isEmpty();
+
+
table.newRowDelta().addDeletes(FILE_A_DELETES).addDeletes(FILE_A2_DELETES).commit();
+
+Snapshot deleteSnapshot = table.currentSnapshot();
+Assertions.assertThat(deleteSnapshot.dataManifests(table.io())).hasSize(1);
+
Assertions.assertThat(deleteSnapshot.deleteManifests(table.io())).hasSize(1);
+
+table.rewriteManifests().clusterBy(file -> 
file.path().toString()).commit();
+
+Snapshot rewriteSnapshot = table.currentSnapshot();
+
+validateSummary(rewriteSnapshot, 1, 1, 2, 2);
+
+List dataManifests = sortedDataManifests(table.io(), 
rewriteSnapshot);
+Assertions.assertThat(dataManifests).hasSize(2);
+validateManifest(
+dataManifests.get(0),
+dataSeqs(appendSnapshot.sequenceNumber(), 
appendSnapshot.sequenceNumber()),

Review Comment:
   Nit: Could put sequence number and snapshot ids in some local vars?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Support replacing delete manifests [iceberg]

2023-11-08 Thread via GitHub


aokolnychyi commented on code in PR #9000:
URL: https://github.com/apache/iceberg/pull/9000#discussion_r1387199381


##
core/src/test/java/org/apache/iceberg/TestRewriteManifests.java:
##
@@ -1105,6 +1108,499 @@ public void testRewriteManifestsOnBranchUnsupported() {
 "Cannot commit to branch someBranch: 
org.apache.iceberg.BaseRewriteManifests does not support branch commits");
   }
 
+  @Test
+  public void testRewriteDataManifestsPreservesDeletes() {
+Assumptions.assumeThat(formatVersion).isGreaterThan(1);
+
+Table table = load();
+
+table.newAppend().appendFile(FILE_A).appendFile(FILE_B).commit();
+
+Snapshot appendSnapshot = table.currentSnapshot();
+Assertions.assertThat(appendSnapshot.dataManifests(table.io())).hasSize(1);
+
Assertions.assertThat(appendSnapshot.deleteManifests(table.io())).isEmpty();
+
+
table.newRowDelta().addDeletes(FILE_A_DELETES).addDeletes(FILE_A2_DELETES).commit();
+
+Snapshot deleteSnapshot = table.currentSnapshot();
+Assertions.assertThat(deleteSnapshot.dataManifests(table.io())).hasSize(1);
+
Assertions.assertThat(deleteSnapshot.deleteManifests(table.io())).hasSize(1);
+
+table.rewriteManifests().clusterBy(file -> 
file.path().toString()).commit();

Review Comment:
   It is `ContentFile$path` that returns `CharSequence`. The idea of this 
closure is to split manifest entries based on the content file path. This 
ensures the original manifest with 2 entires is split into 2 new manifests.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Support replacing delete manifests [iceberg]

2023-11-08 Thread via GitHub


aokolnychyi commented on code in PR #9000:
URL: https://github.com/apache/iceberg/pull/9000#discussion_r1387200273


##
core/src/test/java/org/apache/iceberg/TestRewriteManifests.java:
##
@@ -1105,6 +1108,499 @@ public void testRewriteManifestsOnBranchUnsupported() {
 "Cannot commit to branch someBranch: 
org.apache.iceberg.BaseRewriteManifests does not support branch commits");
   }
 
+  @Test
+  public void testRewriteDataManifestsPreservesDeletes() {
+Assumptions.assumeThat(formatVersion).isGreaterThan(1);
+
+Table table = load();
+
+table.newAppend().appendFile(FILE_A).appendFile(FILE_B).commit();
+
+Snapshot appendSnapshot = table.currentSnapshot();
+Assertions.assertThat(appendSnapshot.dataManifests(table.io())).hasSize(1);
+
Assertions.assertThat(appendSnapshot.deleteManifests(table.io())).isEmpty();
+
+
table.newRowDelta().addDeletes(FILE_A_DELETES).addDeletes(FILE_A2_DELETES).commit();
+
+Snapshot deleteSnapshot = table.currentSnapshot();
+Assertions.assertThat(deleteSnapshot.dataManifests(table.io())).hasSize(1);
+
Assertions.assertThat(deleteSnapshot.deleteManifests(table.io())).hasSize(1);
+
+table.rewriteManifests().clusterBy(file -> 
file.path().toString()).commit();
+
+Snapshot rewriteSnapshot = table.currentSnapshot();
+
+validateSummary(rewriteSnapshot, 1, 1, 2, 2);
+
+List dataManifests = sortedDataManifests(table.io(), 
rewriteSnapshot);
+Assertions.assertThat(dataManifests).hasSize(2);
+validateManifest(
+dataManifests.get(0),
+dataSeqs(appendSnapshot.sequenceNumber(), 
appendSnapshot.sequenceNumber()),

Review Comment:
   Will do.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Support replacing delete manifests [iceberg]

2023-11-08 Thread via GitHub


aokolnychyi commented on code in PR #9000:
URL: https://github.com/apache/iceberg/pull/9000#discussion_r1387199381


##
core/src/test/java/org/apache/iceberg/TestRewriteManifests.java:
##
@@ -1105,6 +1108,499 @@ public void testRewriteManifestsOnBranchUnsupported() {
 "Cannot commit to branch someBranch: 
org.apache.iceberg.BaseRewriteManifests does not support branch commits");
   }
 
+  @Test
+  public void testRewriteDataManifestsPreservesDeletes() {
+Assumptions.assumeThat(formatVersion).isGreaterThan(1);
+
+Table table = load();
+
+table.newAppend().appendFile(FILE_A).appendFile(FILE_B).commit();
+
+Snapshot appendSnapshot = table.currentSnapshot();
+Assertions.assertThat(appendSnapshot.dataManifests(table.io())).hasSize(1);
+
Assertions.assertThat(appendSnapshot.deleteManifests(table.io())).isEmpty();
+
+
table.newRowDelta().addDeletes(FILE_A_DELETES).addDeletes(FILE_A2_DELETES).commit();
+
+Snapshot deleteSnapshot = table.currentSnapshot();
+Assertions.assertThat(deleteSnapshot.dataManifests(table.io())).hasSize(1);
+
Assertions.assertThat(deleteSnapshot.deleteManifests(table.io())).hasSize(1);
+
+table.rewriteManifests().clusterBy(file -> 
file.path().toString()).commit();

Review Comment:
   It is `ContentFile$path` that returns `CharSequence`. The idea of this 
closure is to split manifest entries based on the content file path. This 
ensures the original manifest with 2 entires is split into 2 new manifests. It 
is not the path of the manifest.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Spark 3.4: Display more read metrics on Spark SQL UI [iceberg]

2023-11-08 Thread via GitHub


karuppayya opened a new pull request, #9009:
URL: https://github.com/apache/iceberg/pull/9009

   This is cherry-pick of the following comment in 3.5
   ```
   a44592501 Spark 3.5: Display more read metrics on Spark SQL UI (#8717)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Support replacing delete manifests [iceberg]

2023-11-08 Thread via GitHub


aokolnychyi commented on code in PR #9000:
URL: https://github.com/apache/iceberg/pull/9000#discussion_r1387249948


##
core/src/test/java/org/apache/iceberg/TestRewriteManifests.java:
##
@@ -1105,6 +1108,499 @@ public void testRewriteManifestsOnBranchUnsupported() {
 "Cannot commit to branch someBranch: 
org.apache.iceberg.BaseRewriteManifests does not support branch commits");
   }
 
+  @Test
+  public void testRewriteDataManifestsPreservesDeletes() {
+Assumptions.assumeThat(formatVersion).isGreaterThan(1);
+
+Table table = load();
+
+table.newAppend().appendFile(FILE_A).appendFile(FILE_B).commit();
+
+Snapshot appendSnapshot = table.currentSnapshot();
+Assertions.assertThat(appendSnapshot.dataManifests(table.io())).hasSize(1);
+
Assertions.assertThat(appendSnapshot.deleteManifests(table.io())).isEmpty();
+
+
table.newRowDelta().addDeletes(FILE_A_DELETES).addDeletes(FILE_A2_DELETES).commit();
+
+Snapshot deleteSnapshot = table.currentSnapshot();
+Assertions.assertThat(deleteSnapshot.dataManifests(table.io())).hasSize(1);
+
Assertions.assertThat(deleteSnapshot.deleteManifests(table.io())).hasSize(1);
+
+table.rewriteManifests().clusterBy(file -> 
file.path().toString()).commit();
+
+Snapshot rewriteSnapshot = table.currentSnapshot();
+
+validateSummary(rewriteSnapshot, 1, 1, 2, 2);
+
+List dataManifests = sortedDataManifests(table.io(), 
rewriteSnapshot);
+Assertions.assertThat(dataManifests).hasSize(2);
+validateManifest(
+dataManifests.get(0),
+dataSeqs(appendSnapshot.sequenceNumber(), 
appendSnapshot.sequenceNumber()),

Review Comment:
   Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Build: Bump pyarrow from 14.0.0 to 14.0.1 [iceberg-python]

2023-11-08 Thread via GitHub


dependabot[bot] opened a new pull request, #136:
URL: https://github.com/apache/iceberg-python/pull/136

   Bumps [pyarrow](https://github.com/apache/arrow) from 14.0.0 to 14.0.1.
   
   Commits
   
   https://github.com/apache/arrow/commit/ba537483618196f50c67a90a473039e4d5dc35e0";>ba53748
 MINOR: [Release] Update versions for 14.0.1
   https://github.com/apache/arrow/commit/529f3768fa4fce80781cd1a3cbdcf3a826281b14";>529f376
 MINOR: [Release] Update .deb/.rpm changelogs for 14.0.1
   https://github.com/apache/arrow/commit/b84bbcac64e184a2b58661385506cff23006cc10";>b84bbca
 MINOR: [Release] Update CHANGELOG.md for 14.0.1
   https://github.com/apache/arrow/commit/f14170976372436ec1d03a724d8d3f3925484ecf";>f141709
 https://redirect.github.com/apache/arrow/issues/38607";>GH-38607: 
[Python] Disable PyExtensionType autoload (https://redirect.github.com/apache/arrow/issues/38608";>#38608)
   https://github.com/apache/arrow/commit/5a37e741987e4cba41dfdad2331a308c95b62083";>5a37e74
 https://redirect.github.com/apache/arrow/issues/38431";>GH-38431: 
[Python][CI] Update fs.type_name checks for s3fs tests (https://redirect.github.com/apache/arrow/issues/38455";>#38455)
   See full diff in https://github.com/apache/arrow/compare/go/v14.0.0...apache-arrow-14.0.1";>compare
 view
   
   
   
   
   
   [![Dependabot compatibility 
score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=pyarrow&package-manager=pip&previous-version=14.0.0&new-version=14.0.1)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
   
   Dependabot will resolve any conflicts with this PR as long as you don't 
alter it yourself. You can also trigger a rebase manually by commenting 
`@dependabot rebase`.
   
   [//]: # (dependabot-automerge-start)
   [//]: # (dependabot-automerge-end)
   
   ---
   
   
   Dependabot commands and options
   
   
   You can trigger Dependabot actions by commenting on this PR:
   - `@dependabot rebase` will rebase this PR
   - `@dependabot recreate` will recreate this PR, overwriting any edits that 
have been made to it
   - `@dependabot merge` will merge this PR after your CI passes on it
   - `@dependabot squash and merge` will squash and merge this PR after your CI 
passes on it
   - `@dependabot cancel merge` will cancel a previously requested merge and 
block automerging
   - `@dependabot reopen` will reopen this PR if it is closed
   - `@dependabot close` will close this PR and stop Dependabot recreating it. 
You can achieve the same result by closing it manually
   - `@dependabot show  ignore conditions` will show all of 
the ignore conditions of the specified dependency
   - `@dependabot ignore this major version` will close this PR and stop 
Dependabot creating any more for this major version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this minor version` will close this PR and stop 
Dependabot creating any more for this minor version (unless you reopen the PR 
or upgrade to it yourself)
   - `@dependabot ignore this dependency` will close this PR and stop 
Dependabot creating any more for this dependency (unless you reopen the PR or 
upgrade to it yourself)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Core: Support replacing delete manifests [iceberg]

2023-11-08 Thread via GitHub


aokolnychyi closed pull request #9000: Core: Support replacing delete manifests
URL: https://github.com/apache/iceberg/pull/9000


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] planFiles with ParallelIterator OOM(Out of memory) [iceberg]

2023-11-08 Thread via GitHub


github-actions[bot] commented on issue #7594:
URL: https://github.com/apache/iceberg/issues/7594#issuecomment-1802943749

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] How to ensure the data is not repeated when using spark to write to the iceberg table [iceberg]

2023-11-08 Thread via GitHub


github-actions[bot] commented on issue #7554:
URL: https://github.com/apache/iceberg/issues/7554#issuecomment-1802943769

   This issue has been automatically marked as stale because it has been open 
for 180 days with no activity. It will be closed in next 14 days if no further 
activity occurs. To permanently prevent this issue from being considered stale, 
add the label 'not-stale', but commenting on the issue is preferred when 
possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Spark 3.4: Display more read metrics on Spark SQL UI [iceberg]

2023-11-08 Thread via GitHub


aokolnychyi merged PR #9009:
URL: https://github.com/apache/iceberg/pull/9009


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] DELETE fails with "java.lang.IllegalArgumentException: info must be ExtendedLogicalWriteInfo" [iceberg]

2023-11-08 Thread via GitHub


bknbkn commented on issue #8926:
URL: https://github.com/apache/iceberg/issues/8926#issuecomment-1803040845

   check dependence spark-extensions module correctly


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Parquet: Support reading INT96 column in row group filter [iceberg]

2023-11-08 Thread via GitHub


manuzhang commented on PR #8988:
URL: https://github.com/apache/iceberg/pull/8988#issuecomment-1803050048

   Any more comments?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] to_pandas() API which converts iceberg table scan to a pd.DataFrame will lost datetime data type and row order [iceberg-python]

2023-11-08 Thread via GitHub


zeddit commented on issue #132:
URL: https://github.com/apache/iceberg-python/issues/132#issuecomment-1803117179

   Here are my experiments and main findings.
   ### 1. checking for consistent ordering of pyiceberg
   firstly, I create an emtry table with no partition and sorted_by properties, 
and try to find if pyiceberg will return a deterministic results, because if 
pyiceberg load data in random, there is no way to achieve the goal.
   
   I create the table with hive catalog in trino 
   ```
   CREATE TABLE test_table1(
   date date
   )
   WITH (
   format = 'PARQUET',
   location = 's3a://test/test_table1'
   );
   ```
   and insert some values into it with
   ```
   INSERT INTO test_table1 VALUES (date '2021-01-04');
   ```
   I have inserted a sequence of date one by one, e.g. '2021-01-01', 
'2021-01-04', '2021-01-03', '2021-01-07', '2021-01-02'. the seq is not sorted.
   Every insertion will create a snapshot, and because we insert the row one by 
one, each row will be put in it's own data-file.
   
   After making that table, we start to select/read the table, and observe the 
result.
   In trino, when using `select * from test_table1`, there won't be a 
consistent result, every row could be in any position.
   while in pyiceberg, the order of rows is fixed, and it is the reverse one as 
we insert the rows.
   I run the load_table 100 times and compare the results of every run, they 
all returns the same order.
   
   And when I try to insert some new rows into the table, it shows up in the 
front few lines in the results. e.g. we adding '2021-01-05', '2021-01-09', 
'2021-01-06', '2021-01-08' one by one
   https://github.com/apache/iceberg-python/assets/30164206/0f30573c-a2f8-4d66-b312-e225ef49f4c0";>
   
   I also checked the source code in pyiceberg/io/pyarrow.py. there is a 
consistent ordering for reading data-files, and within each data-file, the 
order is the one when writting.
   
   Then I use `alter table test_table1 execute optimize;` to merge them into 
one data-file. 
   After that, both trino client and pyiceberg will return a consistent result. 
it's because there is only one data-file, so the order in it is preserved.
   However, if the table has no `sorted_by` properties, there is no order 
guarantee when compact the data-files.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] to_pandas() API which converts iceberg table scan to a pd.DataFrame will lost datetime data type and row order [iceberg-python]

2023-11-08 Thread via GitHub


zeddit commented on issue #132:
URL: https://github.com/apache/iceberg-python/issues/132#issuecomment-1803131783

   ### 2. adding sorted_by properties, or the term of sort_order in iceberg
   because we want a sorted iceberg table to store and return time series data.
   so we create a sorted_by table in Trino with
   ```
   CREATE TABLE test_table2(
   date date
   )
   WITH (
   format = 'PARQUET',
   location = 's3a://test/test_table2',
   sorted_by = ARRAY['date']
   );
   ```
   and we check the table properties in pyiceberg CLI to make sure sort_order 
properties is set
   ```
   $ pyiceberg describe test_table2
   Table format version  2
   Metadata location 
s3a://test/test_table2/metadata/00025-cd229550-87da-4b49-8013-28d862e8…
   Table UUIDe9c38f74-f556-43b2-a9f9-6facdd884723
   Last Updated  1699465355600
   Partition spec[]
   Sort order[
   1 ASC NULLS FIRST
 ]
   Current schemaSchema, id=0
 └── 1: date: optional date
   ```
   Then we insert the rows again to this new table. We still insert them one by 
one, with a same sequence like '2021-01-01', '2021-01-04', '2021-01-03', 
'2021-01-07', '2021-01-02', '2021-01-05', '2021-01-09', '2021-01-06', 
'2021-01-08'.
   
   Then we start to get data out of the new table. 
   In trino, when using select * from test_table1, there still is no consistent 
order, the result returned varies between runs.
   In pyiceberg, it's good to see there is a fixed order, however, the order is 
not the Ascending order that we stated and want for the table. It's because we 
insert rows one by one, and they are in different data-files underlying.
   
   When we compact the table into one data-file with Trino `alter table 
test_table2 execute optimize;`. 
   The result both in trino and pyiceberg is fixed and ordered. that is because 
there is only one data-file and the result order is the one that how data is 
written into that data-file. 
   When specifying `sorted_by` properties, a local sort will be performed to 
ensure that in each data-file, there will be in order.
   
   Then we try to append the same amount of data into the sorted table with
   ```
   insert into test_table2 select * from test_table1;
   ```
   we got the result below:
   https://github.com/apache/iceberg-python/assets/30164206/b05a3e25-eacd-45a1-8d63-7442fb3fe12d";>
   It's not correct, what we expected is 
   https://github.com/apache/iceberg-python/assets/30164206/976d1022-4b1e-488c-8866-4f465db40176";>
   it's because the second insert store rows in another data-file, which is 
just concat with the one before.
   
   when using `alter table test_table2 execute optimize;` to merge them, the 
result is as expected.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] to_pandas() API which converts iceberg table scan to a pd.DataFrame will lost datetime data type and row order [iceberg-python]

2023-11-08 Thread via GitHub


zeddit commented on issue #132:
URL: https://github.com/apache/iceberg-python/issues/132#issuecomment-1803136799

   ### 3. how about partitioned tables
   it is a common case for people to use partition to manage tables, and in 
iceberg partition will lead to a great performance gain by skipping a lot of 
some data-files.
   we test the order behaviors with the following two tables.
   ```
   CREATE TABLE test_table3(
   date date
   )
   WITH (
   format = 'PARQUET',
   format_version = 2,
   location = 's3a://test/test_table3',
   partitioning = ARRAY['month(date)'],
   sorted_by = ARRAY['date']
   )
   ---
CREATE TABLE test_table4 (
   date date,
   sym varchar
)
WITH (
   format = 'PARQUET',
   format_version = 2,
   location = 's3a://test/test_table4',
   partitioning = ARRAY['sym'],
   sorted_by = ARRAY['sym','date']
)
   ```
   We also insert some values and get the results to observe their orders.
   It's a bad news that the order between partitions will never be under 
controlled by any means of controlling the writing method. e.g. even when we 
conduct a global sort, the month order in the final result is still a random 
one, which make time series analysis disappointed. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] to_pandas() API which converts iceberg table scan to a pd.DataFrame will lost datetime data type and row order [iceberg-python]

2023-11-08 Thread via GitHub


zeddit commented on issue #132:
URL: https://github.com/apache/iceberg-python/issues/132#issuecomment-1803138817

   ### 4. limitations about the experiments
   I have not tested the case for
   
   - 1. when data is large enough and get split with a partition, i.e. multiple 
data-files and they contain part-of the whole data.
   - 2. what will happen when a row deletion occurs, if the order would be 
messed up.
   - 2. what will happen when scheme evolve occurs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] to_pandas() API which converts iceberg table scan to a pd.DataFrame will lost datetime data type and row order [iceberg-python]

2023-11-08 Thread via GitHub


zeddit commented on issue #132:
URL: https://github.com/apache/iceberg-python/issues/132#issuecomment-1803140124

   I have a first step idea that we could add a parse module which inspect the 
manifests and determine an order for how to concat pa.tables to assemble the 
final output. I don't know if it could be realizable.
   
   I will dig into the structure of manifest and return with more feedbacks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



[PR] Refactor HiveTableOperations with common code for View. [iceberg]

2023-11-08 Thread via GitHub


nk1506 opened a new pull request, #9011:
URL: https://github.com/apache/iceberg/pull/9011

   As part of issues #8698 , All the common piece between table and view has 
been moved from `HiveTableOperations` to a new helper class. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] manifest exception [iceberg]

2023-11-08 Thread via GitHub


innocent123 commented on issue #8994:
URL: https://github.com/apache/iceberg/issues/8994#issuecomment-1803186989

   > @innocent123: I do not really understand your question, but I think your 
problem might be similar to #5846.
   
   when i use spark api rewriteDataFiles is reported 
"org.apache.iceberg.exceptions.RuntimeIOException:  Failed to get block 
locations for path: 
hdfs:///iceberg-hive/hadoop_prod/iceberg_x5l_pre/xxx/data/sample_date=20231107/9-0-54295f94-ef26-44be-b6b8-ca3e472d9482-00010.parquet"
   
   How do I fix this error?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] manifest exception [iceberg]

2023-11-08 Thread via GitHub


innocent123 commented on issue #8994:
URL: https://github.com/apache/iceberg/issues/8994#issuecomment-1803189818

   > > @innocent123: I do not really understand your question, but I think your 
problem might be similar to #5846.
   > 
   > when i use spark api rewriteDataFiles is reported 
"org.apache.iceberg.exceptions.RuntimeIOException: Failed to get block 
locations for path: 
hdfs:///iceberg-hive/hadoop_prod/iceberg_x5l_pre/xxx/data/sample_date=20231107/9-0-54295f94-ef26-44be-b6b8-ca3e472d9482-00010.parquet"
   > 
   > How do I fix this error?
   
   this data file is lose,and manifest_path still exist


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] manifest exception [iceberg]

2023-11-08 Thread via GitHub


innocent123 commented on issue #8994:
URL: https://github.com/apache/iceberg/issues/8994#issuecomment-1803192032

   spark version 3.0.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] manifest exception [iceberg]

2023-11-08 Thread via GitHub


innocent123 commented on issue #8994:
URL: https://github.com/apache/iceberg/issues/8994#issuecomment-1803192419

   iceberg version1.0.0


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Hive: Add View support for HIVE catalog [iceberg]

2023-11-08 Thread via GitHub


nk1506 commented on code in PR #8907:
URL: https://github.com/apache/iceberg/pull/8907#discussion_r1387522573


##
hive-metastore/src/main/java/org/apache/iceberg/hive/HiveViewOperations.java:
##
@@ -0,0 +1,389 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package org.apache.iceberg.hive;
+
+import java.util.Collections;
+import java.util.Locale;
+import java.util.Map;
+import java.util.Objects;
+import java.util.Optional;
+import java.util.concurrent.atomic.AtomicReference;
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.hive.metastore.IMetaStoreClient;
+import org.apache.hadoop.hive.metastore.TableType;
+import org.apache.hadoop.hive.metastore.api.InvalidObjectException;
+import org.apache.hadoop.hive.metastore.api.NoSuchObjectException;
+import org.apache.hadoop.hive.metastore.api.SerDeInfo;
+import org.apache.hadoop.hive.metastore.api.StorageDescriptor;
+import org.apache.hadoop.hive.metastore.api.Table;
+import org.apache.iceberg.BaseMetastoreTableOperations;
+import org.apache.iceberg.ClientPool;
+import org.apache.iceberg.SchemaParser;
+import org.apache.iceberg.TableProperties;
+import org.apache.iceberg.catalog.TableIdentifier;
+import org.apache.iceberg.exceptions.AlreadyExistsException;
+import org.apache.iceberg.exceptions.CommitFailedException;
+import org.apache.iceberg.exceptions.CommitStateUnknownException;
+import org.apache.iceberg.exceptions.NoSuchViewException;
+import org.apache.iceberg.exceptions.ValidationException;
+import org.apache.iceberg.hive.HiveCatalogUtil.CommitStatus;
+import org.apache.iceberg.io.FileIO;
+import org.apache.iceberg.relocated.com.google.common.base.Preconditions;
+import org.apache.iceberg.relocated.com.google.common.collect.ImmutableMap;
+import org.apache.iceberg.relocated.com.google.common.collect.Maps;
+import org.apache.iceberg.util.Tasks;
+import org.apache.iceberg.view.BaseViewOperations;
+import org.apache.iceberg.view.ViewMetadata;
+import org.apache.thrift.TException;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/** Hive implementation of Iceberg ViewOperations. */
+final class HiveViewOperations extends BaseViewOperations {

Review Comment:
   @pvary , I have refactored common code 
[here](https://github.com/apache/iceberg/pull/9011)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Parquet: Support reading INT96 column in row group filter [iceberg]

2023-11-08 Thread via GitHub


nastra commented on PR #8988:
URL: https://github.com/apache/iceberg/pull/8988#issuecomment-1803297752

   I think it would be good to also get the opinion of @RussellSpitzer or 
@aokolnychyi on this one


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [I] Default table properties not respected when using Spark DataFrame API [iceberg]

2023-11-08 Thread via GitHub


nastra commented on issue #8265:
URL: https://github.com/apache/iceberg/issues/8265#issuecomment-1803299179

   @boushphong yes feel free to work on this in case you're interested


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org



Re: [PR] Nessie: Support views for NessieCatalog [iceberg]

2023-11-08 Thread via GitHub


nastra commented on code in PR #8909:
URL: https://github.com/apache/iceberg/pull/8909#discussion_r1387607715


##
core/src/test/java/org/apache/iceberg/view/ViewCatalogTests.java:
##
@@ -400,8 +400,15 @@ public void 
replaceTableViaTransactionThatAlreadyExistsAsView() {
 .buildTable(viewIdentifier, SCHEMA)
 .replaceTransaction()
 .commitTransaction())
-.isInstanceOf(NoSuchTableException.class)
-.hasMessageStartingWith("Table does not exist: ns.view");
+.satisfiesAnyOf(
+throwable ->
+assertThat(throwable)
+.isInstanceOf(NoSuchTableException.class)
+.hasMessageStartingWith("Table does not exist: ns.view"),
+throwable ->
+assertThat(throwable)

Review Comment:
   I actually think that the way Nessie currently handles this error case is 
correct by saying that a view with the same name already exists when you try to 
create a table. To achieve the same for REST is actually slightly more 
difficult. There's also a TODO in the test a few lines above where I wanted to 
improve the error reporting for these 2 particular cases that were adjusted in 
this test. That being said, I'm +1 on these 2 changes in the test



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org