Re: [I] to_pandas() API which converts iceberg table scan to a pd.DataFrame will lost datetime data type and row order [iceberg-python]

via GitHub Thu, 09 Nov 2023 00:54:37 -0800


Fokko commented on issue #132:
URL: https://github.com/apache/iceberg-python/issues/132#issuecomment-1803401481


   > In my opinion, if we need user to sort large dataset in python, it will 
make a lot use cases unusable, especially for time series cases. python is 
slow, not the way like MPP or spark situation.
   
   I agree there, but the API is often just in Python. For example, we use 
PyArrow, which pushes it down into the C++ layer.
   
   > I wanna know if pyiceberg read data in a consistent way.
   
   Yes, we do some ordering. This is important when we fetch the top n rows, we 
always want to return the same rows, and not the ones that came back the 
quickest.
   
   > if local sort is used for partitions, the row order within partition will 
be persevered. in another word, the order is the one when data is being written.
   
   
   Yes this is true, and I think you can achieve this today. Especially when 
the data is small (which it is in your case IIRC from Slack), then you could 
just write a single file for the partition.
   
   > Every insertion will create a snapshot, and because we insert the row 
individually, each row will be put in it's own data-file.
   
   Every insertion being a new snapshot is okay, but I think it will also 
create a new manifest per commit. Trino uses 
[fast-append](https://iceberg.apache.org/spec/#snapshots) which means that for 
each append operation a new manifest is created, instead of rewriting the 
existing metadata. We [keep the order of 
sequences](https://github.com/apache/iceberg-python/blob/main/pyiceberg/table/__init__.py#L871-L875)
 since the Python list will maintain order.
   
   > Then we try to append the same amount of data into the sorted table with
   
   If you would add a `ORDER BY 1` you'll introduce a global sort which should 
probably fix without needing to optimize the table.
   
   > It's a bad news that the order between partitions will never be under 
controlled by any means of controlling the writing method. e.g. even when we 
conduct a global sort, the month order in the final result is still a random 
one, which make time series analysis disappointed.
   
   I think you can also fix this by adding an `ORDER BY`.
   
   > when data is large enough and get split within a partition, i.e. multiple 
data-files and they contain part-of the whole data.
   
   If your data is relative small, then having a single file is best. Also, you 
could tune the row group sizes to get decent parallism (PyArrow will do this 
for you).
   
   > what will happen when a row deletion occurs, if the order would be messed 
up.
   
   The order should be maintained. 
   
   > what will happen when scheme evolve occurs.
   
   Should not influence the order since it is just schema projection.
   
   > besides, it's quite strange that for rows newly inserted to show at the 
top of the final results, it will add challenge to integrated with the data 
ingestion subsystem in my opinion. any advices on it.
   
   I think this depends mostly on Trino, on how the data is written. As 
mentioned earlier, the fast-append might mess things up because you're relying 
on how Trino produces the manifest.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org
For additional commands, e-mail: issues-h...@iceberg.apache.org

Re: [I] to_pandas() API which converts iceberg table scan to a pd.DataFrame will lost datetime data type and row order [iceberg-python]

Reply via email to