wgtmac commented on code in PR #30: URL: https://github.com/apache/iceberg-cpp/pull/30#discussion_r2010260510
########## src/iceberg/file_io.h: ########## Review Comment: Why do we need this? ########## src/iceberg/file_io.h: ########## @@ -0,0 +1,112 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#pragma once + +#include <memory> +#include <string> + +#include "iceberg/iceberg_export.h" + +namespace iceberg { + +// Forward declarations +class Reader; +class Writer; + +/// \brief An interface used to read input files using Reader and AsyncReader +class ICEBERG_EXPORT InputFile { + public: + explicit InputFile(std::string location) : location_(std::move(location)) {} + + virtual ~InputFile() = default; + + /// \brief Checks whether the file exists. + virtual bool exists() const = 0; + /// \brief Returns the total length of the file, in bytes. + virtual int64_t getLength() const = 0; + + /// \brief Get a Reader instance to read bytes from the file. + virtual std::unique_ptr<Reader> newReader() = 0; + + /// \brief Get the file location + const std::string& location() const { return location_; } + + protected: + std::string location_; +}; + +/// \brief An interface used to write output files using Writer and AsyncWriter +class ICEBERG_EXPORT OutputFile { + public: + explicit OutputFile(std::string location) : location_(std::move(location)) {} + + virtual ~OutputFile() = default; + + /// \brief Create the file. + /// + /// If the file exists, or an error will be thrown. + virtual void create() = 0; + + /// \brief Get a Writer instance to write bytes to the file. + virtual std::unique_ptr<Writer> newWriter() = 0; + + /// \brief Get the file location + const std::string& location() const { return location_; } + + /// \brief Return an InputFile for the location of this OutputFile. + virtual std::shared_ptr<InputFile> toInputFile() const = 0; + + protected: + std::string location_; +}; + +/// \brief Pluggable module for reading, writing, and deleting files. +/// +/// Both table metadata files and data files can be written and read by this module. +class ICEBERG_EXPORT FileIO { + public: + explicit FileIO(std::string name) : name_(std::move(name)) {} + + virtual ~FileIO() = default; + + /// \brief Get an InputFile + /// + /// Get a InputFile instance to read bytes from the file at the given location. + virtual std::shared_ptr<InputFile> newInputFile(const std::string& location) = 0; Review Comment: I'm a little bit skeptical of the usability of `InputFile` and `OutputFile` interfaces here. All Parquet/Avro/Orc libraries have defined their own equivalent APIs and downstream projects should already have a translation layer between their inhouse I/O interface (e.g. arrow::FileSystem) and the file format libraries. Adding `InputFile` and `OutputFile` interfaces just impose them to implement yet another translation layer in between. ########## CMakeLists.txt: ########## @@ -38,6 +38,7 @@ option(ICEBERG_BUILD_SHARED "Build shared library" OFF) option(ICEBERG_BUILD_TESTS "Build tests" ON) option(ICEBERG_ARROW "Build Arrow" ON) option(ICEBERG_AVRO "Build Avro" ON) +option(ICEBERG_IO "Build IO" ON) Review Comment: Can we remove this? I think we don't need a separate library for I/O. We can just define `FileIO` and related interfaces in the `libiceberg`. `iceberg-arrow` could possibly implement these APIs by adapting to `arrow::FileSystem`. We can provide an implementation for `local` or `memory` filesystem in the unit test modules. ########## src/iceberg/io/fs_file_reader.cc: ########## @@ -0,0 +1,80 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, + * software distributed under the License is distributed on an + * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + * KIND, either express or implied. See the License for the + * specific language governing permissions and limitations + * under the License. + */ + +#include "iceberg/io/fs_file_reader.h" + +#include <format> + +#include "iceberg/exception.h" + +namespace iceberg::io { + +FsFileReader::FsFileReader(std::string file_path) : file_path_(std::move(file_path)) { + // Open the file in binary mode + input_file_.open(file_path_, std::ios::binary | std::ios::in); + if (!input_file_.is_open()) { + throw IcebergError(std::format("Failed to open file: {}", file_path_)); + } + + // Calculate the file size + input_file_.seekg(0, std::ios::end); + file_size_ = input_file_.tellg(); + input_file_.seekg(0, std::ios::beg); + + if (file_size_ < 0) { + throw IcebergError(std::format("Failed to determine file size: {}", file_path_)); + } +} + +FsFileReader::~FsFileReader() { + if (input_file_.is_open()) { + input_file_.close(); + } +} + +int64_t FsFileReader::read(ReadRange range, void* buffer) { + if (!input_file_.is_open()) { + throw IcebergError("File is not open for reading"); + } + + if (range.offset < 0 || range.offset + range.length > file_size_) { + throw IcebergError(std::format("Invalid read range: [{}, {})", range.offset, + range.offset + range.length)); + } + + // Seek to the starting position + input_file_.seekg(range.offset, std::ios::beg); + if (input_file_.fail()) { + throw IcebergError(std::format("Failed to seek to offset: {}", range.offset)); + } + + // Read the data into the buffer + input_file_.read(static_cast<char*>(buffer), range.length); + auto bytes_read = static_cast<int64_t>(input_file_.gcount()); + + return bytes_read; // Return actual bytes read +} + +std::future<int64_t> FsFileReader::readAsync(ReadRange range, void* buffer) { Review Comment: I think we'd better remove any async API for now. We can add them later if they are really useful. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@iceberg.apache.org For additional commands, e-mail: issues-h...@iceberg.apache.org