This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch update-pipes-docs in repository https://gitbox.apache.org/repos/asf/tika.git
commit d720e7dde92025a13e36e52752203c19e955c79d Author: tallison <[email protected]> AuthorDate: Thu Apr 9 21:05:10 2026 -0400 update pipes docs --- docs/modules/ROOT/nav.adoc | 6 +++ docs/modules/ROOT/pages/pipes/index.adoc | 40 +++++++++++------ .../ROOT/pages/using-tika/java-api/index.adoc | 51 +++++++++++++++++++--- 3 files changed, 78 insertions(+), 19 deletions(-) diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc index 1702591425..a9f4a6e951 100644 --- a/docs/modules/ROOT/nav.adoc +++ b/docs/modules/ROOT/nav.adoc @@ -21,6 +21,12 @@ ** xref:using-tika/cli/index.adoc[Command Line] ** xref:using-tika/grpc/index.adoc[gRPC] * xref:pipes/index.adoc[Pipes] +** xref:pipes/getting-started.adoc[Getting Started] +** xref:pipes/fetchers.adoc[Fetchers] +** xref:pipes/emitters.adoc[Emitters] +** xref:pipes/iterators.adoc[Iterators] +** xref:pipes/reporters.adoc[Reporters] +** xref:pipes/configuration.adoc[Pipeline Configuration] ** xref:pipes/parse-modes.adoc[Parse Modes] ** xref:pipes/unpack-config.adoc[Extracting Embedded Bytes] ** xref:pipes/timeouts.adoc[Timeouts] diff --git a/docs/modules/ROOT/pages/pipes/index.adoc b/docs/modules/ROOT/pages/pipes/index.adoc index a6fea020ae..796f9d7f1f 100644 --- a/docs/modules/ROOT/pages/pipes/index.adoc +++ b/docs/modules/ROOT/pages/pipes/index.adoc @@ -21,24 +21,36 @@ This section covers Tika Pipes for scalable, fault-tolerant document processing. == Overview -Tika Pipes provides a framework for processing large volumes of documents with: +Tika Pipes provides a framework for fault-tolerant, scalable document processing. +Each document is parsed in a forked JVM with configurable timeouts and memory limits, +so a single malformed file cannot crash or hang your application. -* **Fetchers** - Retrieve documents from various sources (filesystem, S3, HTTP, etc.) -* **Emitters** - Send parsed results to various destinations (filesystem, OpenSearch, ES-compatible, Solr, etc.) -* **Pipelines** - Configure processing workflows +While Tika Pipes has a programmatic Java API, it is best used through: -== Topics +* xref:using-tika/cli/index.adoc[tika-app] — batch processing from the command line +* xref:using-tika/server/index.adoc[tika-server] — REST API with pipes-based robustness built in +* xref:using-tika/grpc/index.adoc[tika-grpc] — gRPC API with pipes-based robustness built in + +See xref:advanced/robustness.adoc[Robustness] for details on how Tika Pipes protects +against problematic files. + +=== Key Components -* xref:pipes/parse-modes.adoc[Parse Modes] - Control how documents are parsed and emitted (`RMETA`, `CONCATENATE`, `CONTENT_ONLY`, `UNPACK`) -* xref:pipes/unpack-config.adoc[Extracting Embedded Bytes] - Extract raw bytes from embedded documents using `ParseMode.UNPACK` -* xref:pipes/timeouts.adoc[Timeouts] - Two-tier timeout system for handling long-running and hung parsers +* **Fetchers** — retrieve documents from various sources (filesystem, S3, HTTP, etc.) +* **Emitters** — send parsed results to various destinations (filesystem, OpenSearch, ES-compatible, Solr, etc.) +* **Pipelines** — configure processing workflows + +== Topics -// Add links to specific topics as they are created -// * link:getting-started.html[Getting Started] -// * link:fetchers.html[Fetchers] -// * link:emitters.html[Emitters] -// * link:configuration.html[Configuration] -// * link:async.html[Async Processing] +* xref:pipes/getting-started.adoc[Getting Started] -- complete working example with tika-app +* xref:pipes/fetchers.adoc[Fetchers] -- all available document sources (filesystem, S3, HTTP, GCS, Azure, etc.) +* xref:pipes/emitters.adoc[Emitters] -- all available output destinations (filesystem, ES, OpenSearch, Solr, S3, Kafka, etc.) +* xref:pipes/iterators.adoc[Iterators] -- document enumeration (directory walk, S3 listing, CSV, JDBC, Kafka, etc.) +* xref:pipes/reporters.adoc[Reporters] -- track per-document processing status +* xref:pipes/configuration.adoc[Pipeline Configuration] -- numClients, timeouts, JVM args, parse modes, emit batching +* xref:pipes/parse-modes.adoc[Parse Modes] -- control how documents are parsed and emitted (`RMETA`, `CONCATENATE`, `CONTENT_ONLY`, `UNPACK`) +* xref:pipes/unpack-config.adoc[Extracting Embedded Bytes] -- extract raw bytes from embedded documents +* xref:pipes/timeouts.adoc[Timeouts] -- two-tier timeout system for handling long-running and hung parsers == Emitters diff --git a/docs/modules/ROOT/pages/using-tika/java-api/index.adoc b/docs/modules/ROOT/pages/using-tika/java-api/index.adoc index 4853446d50..22844404a0 100644 --- a/docs/modules/ROOT/pages/using-tika/java-api/index.adoc +++ b/docs/modules/ROOT/pages/using-tika/java-api/index.adoc @@ -24,9 +24,49 @@ This section covers using Apache Tika programmatically in your Java applications Tika can be embedded directly into your Java applications as a library. This gives you full control over parsing, detection, and configuration. -However, for most use cases we recommend using xref:using-tika/server/index.adoc[tika-server] -or xref:using-tika/grpc/index.adoc[tika-grpc] instead. See -xref:using-tika/java-api/getting-started.adoc[Getting Started] for guidance on choosing the right approach. +IMPORTANT: Some file formats can trigger excessive memory use, infinite loops, or JVM +crashes in the underlying parsing libraries. For production systems processing untrusted +files, use xref:pipes/index.adoc[Tika Pipes] which runs each parse in a forked JVM with +timeouts and memory limits. Alternatively, xref:using-tika/server/index.adoc[tika-server] +and xref:using-tika/grpc/index.adoc[tika-grpc] provide the same robustness as a service. +See xref:advanced/robustness.adoc[Robustness] for details. + +== Dependencies + +Add the following to your `pom.xml`: + +[source,xml,subs=attributes+] +---- +<dependency> + <groupId>org.apache.tika</groupId> + <artifactId>tika-parsers-standard-package</artifactId> + <version>{tika-version}</version> +</dependency> +---- + +This pulls in `tika-core` and all standard parsers (PDF, Office, HTML, etc.). + +If you only need detection (no parsing) or want to select parsers individually: + +[source,xml,subs=attributes+] +---- +<dependency> + <groupId>org.apache.tika</groupId> + <artifactId>tika-core</artifactId> + <version>{tika-version}</version> +</dependency> +---- + +To use `TikaLoader` for JSON-based configuration, also add: + +[source,xml,subs=attributes+] +---- +<dependency> + <groupId>org.apache.tika</groupId> + <artifactId>tika-serialization</artifactId> + <version>{tika-version}</version> +</dependency> +---- == Parsers @@ -165,14 +205,15 @@ container detection becomes available. [source,java] ---- -TikaConfig tika = new TikaConfig(); +TikaLoader loader = TikaLoader.loadDefault(); +Detector detector = loader.loadDetectors(); ParseContext parseContext = new ParseContext(); for (Path p : myListOfPaths) { Metadata metadata = new Metadata(); try (TikaInputStream stream = TikaInputStream.get(p, metadata)) { - MediaType mimetype = tika.getDetector().detect(stream, metadata, parseContext); + MediaType mimetype = detector.detect(stream, metadata, parseContext); System.out.println("File " + p + " is " + mimetype); } }
