(tika) 01/01: update pipes docs

tallison Thu, 09 Apr 2026 18:05:43 -0700

This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch update-pipes-docs
in repository https://gitbox.apache.org/repos/asf/tika.git


commit d720e7dde92025a13e36e52752203c19e955c79d
Author: tallison <[email protected]>
AuthorDate: Thu Apr 9 21:05:10 2026 -0400

    update pipes docs
---
 docs/modules/ROOT/nav.adoc                         |  6 +++
 docs/modules/ROOT/pages/pipes/index.adoc           | 40 +++++++++++------
 .../ROOT/pages/using-tika/java-api/index.adoc      | 51 +++++++++++++++++++---
 3 files changed, 78 insertions(+), 19 deletions(-)

diff --git a/docs/modules/ROOT/nav.adoc b/docs/modules/ROOT/nav.adoc
index 1702591425..a9f4a6e951 100644
--- a/docs/modules/ROOT/nav.adoc
+++ b/docs/modules/ROOT/nav.adoc
@@ -21,6 +21,12 @@
 ** xref:using-tika/cli/index.adoc[Command Line]
 ** xref:using-tika/grpc/index.adoc[gRPC]
 * xref:pipes/index.adoc[Pipes]
+** xref:pipes/getting-started.adoc[Getting Started]
+** xref:pipes/fetchers.adoc[Fetchers]
+** xref:pipes/emitters.adoc[Emitters]
+** xref:pipes/iterators.adoc[Iterators]
+** xref:pipes/reporters.adoc[Reporters]
+** xref:pipes/configuration.adoc[Pipeline Configuration]
 ** xref:pipes/parse-modes.adoc[Parse Modes]
 ** xref:pipes/unpack-config.adoc[Extracting Embedded Bytes]
 ** xref:pipes/timeouts.adoc[Timeouts]
diff --git a/docs/modules/ROOT/pages/pipes/index.adoc 
b/docs/modules/ROOT/pages/pipes/index.adoc
index a6fea020ae..796f9d7f1f 100644
--- a/docs/modules/ROOT/pages/pipes/index.adoc
+++ b/docs/modules/ROOT/pages/pipes/index.adoc
@@ -21,24 +21,36 @@ This section covers Tika Pipes for scalable, fault-tolerant 
document processing.
 
 == Overview
 
-Tika Pipes provides a framework for processing large volumes of documents with:
+Tika Pipes provides a framework for fault-tolerant, scalable document 
processing.
+Each document is parsed in a forked JVM with configurable timeouts and memory 
limits,
+so a single malformed file cannot crash or hang your application.
 
-* **Fetchers** - Retrieve documents from various sources (filesystem, S3, 
HTTP, etc.)
-* **Emitters** - Send parsed results to various destinations (filesystem, 
OpenSearch, ES-compatible, Solr, etc.)
-* **Pipelines** - Configure processing workflows
+While Tika Pipes has a programmatic Java API, it is best used through:
 
-== Topics
+* xref:using-tika/cli/index.adoc[tika-app] — batch processing from the command 
line
+* xref:using-tika/server/index.adoc[tika-server] — REST API with pipes-based 
robustness built in
+* xref:using-tika/grpc/index.adoc[tika-grpc] — gRPC API with pipes-based 
robustness built in
+
+See xref:advanced/robustness.adoc[Robustness] for details on how Tika Pipes 
protects
+against problematic files.
+
+=== Key Components
 
-* xref:pipes/parse-modes.adoc[Parse Modes] - Control how documents are parsed 
and emitted (`RMETA`, `CONCATENATE`, `CONTENT_ONLY`, `UNPACK`)
-* xref:pipes/unpack-config.adoc[Extracting Embedded Bytes] - Extract raw bytes 
from embedded documents using `ParseMode.UNPACK`
-* xref:pipes/timeouts.adoc[Timeouts] - Two-tier timeout system for handling 
long-running and hung parsers
+* **Fetchers** — retrieve documents from various sources (filesystem, S3, 
HTTP, etc.)
+* **Emitters** — send parsed results to various destinations (filesystem, 
OpenSearch, ES-compatible, Solr, etc.)
+* **Pipelines** — configure processing workflows
+
+== Topics
 
-// Add links to specific topics as they are created
-// * link:getting-started.html[Getting Started]
-// * link:fetchers.html[Fetchers]
-// * link:emitters.html[Emitters]
-// * link:configuration.html[Configuration]
-// * link:async.html[Async Processing]
+* xref:pipes/getting-started.adoc[Getting Started] -- complete working example 
with tika-app
+* xref:pipes/fetchers.adoc[Fetchers] -- all available document sources 
(filesystem, S3, HTTP, GCS, Azure, etc.)
+* xref:pipes/emitters.adoc[Emitters] -- all available output destinations 
(filesystem, ES, OpenSearch, Solr, S3, Kafka, etc.)
+* xref:pipes/iterators.adoc[Iterators] -- document enumeration (directory 
walk, S3 listing, CSV, JDBC, Kafka, etc.)
+* xref:pipes/reporters.adoc[Reporters] -- track per-document processing status
+* xref:pipes/configuration.adoc[Pipeline Configuration] -- numClients, 
timeouts, JVM args, parse modes, emit batching
+* xref:pipes/parse-modes.adoc[Parse Modes] -- control how documents are parsed 
and emitted (`RMETA`, `CONCATENATE`, `CONTENT_ONLY`, `UNPACK`)
+* xref:pipes/unpack-config.adoc[Extracting Embedded Bytes] -- extract raw 
bytes from embedded documents
+* xref:pipes/timeouts.adoc[Timeouts] -- two-tier timeout system for handling 
long-running and hung parsers
 
 == Emitters
 
diff --git a/docs/modules/ROOT/pages/using-tika/java-api/index.adoc 
b/docs/modules/ROOT/pages/using-tika/java-api/index.adoc
index 4853446d50..22844404a0 100644
--- a/docs/modules/ROOT/pages/using-tika/java-api/index.adoc
+++ b/docs/modules/ROOT/pages/using-tika/java-api/index.adoc
@@ -24,9 +24,49 @@ This section covers using Apache Tika programmatically in 
your Java applications
 Tika can be embedded directly into your Java applications as a library. This 
gives you
 full control over parsing, detection, and configuration.
 
-However, for most use cases we recommend using 
xref:using-tika/server/index.adoc[tika-server]
-or xref:using-tika/grpc/index.adoc[tika-grpc] instead. See
-xref:using-tika/java-api/getting-started.adoc[Getting Started] for guidance on 
choosing the right approach.
+IMPORTANT: Some file formats can trigger excessive memory use, infinite loops, 
or JVM
+crashes in the underlying parsing libraries. For production systems processing 
untrusted
+files, use xref:pipes/index.adoc[Tika Pipes] which runs each parse in a forked 
JVM with
+timeouts and memory limits. Alternatively, 
xref:using-tika/server/index.adoc[tika-server]
+and xref:using-tika/grpc/index.adoc[tika-grpc] provide the same robustness as 
a service.
+See xref:advanced/robustness.adoc[Robustness] for details.
+
+== Dependencies
+
+Add the following to your `pom.xml`:
+
+[source,xml,subs=attributes+]
+----
+<dependency>
+    <groupId>org.apache.tika</groupId>
+    <artifactId>tika-parsers-standard-package</artifactId>
+    <version>{tika-version}</version>
+</dependency>
+----
+
+This pulls in `tika-core` and all standard parsers (PDF, Office, HTML, etc.).
+
+If you only need detection (no parsing) or want to select parsers individually:
+
+[source,xml,subs=attributes+]
+----
+<dependency>
+    <groupId>org.apache.tika</groupId>
+    <artifactId>tika-core</artifactId>
+    <version>{tika-version}</version>
+</dependency>
+----
+
+To use `TikaLoader` for JSON-based configuration, also add:
+
+[source,xml,subs=attributes+]
+----
+<dependency>
+    <groupId>org.apache.tika</groupId>
+    <artifactId>tika-serialization</artifactId>
+    <version>{tika-version}</version>
+</dependency>
+----
 
 == Parsers
 
@@ -165,14 +205,15 @@ container detection becomes available.
 
 [source,java]
 ----
-TikaConfig tika = new TikaConfig();
+TikaLoader loader = TikaLoader.loadDefault();
+Detector detector = loader.loadDetectors();
 ParseContext parseContext = new ParseContext();
 
 for (Path p : myListOfPaths) {
     Metadata metadata = new Metadata();
 
     try (TikaInputStream stream = TikaInputStream.get(p, metadata)) {
-        MediaType mimetype = tika.getDetector().detect(stream, metadata, 
parseContext);
+        MediaType mimetype = detector.detect(stream, metadata, parseContext);
         System.out.println("File " + p + " is " + mimetype);
     }
 }

(tika) 01/01: update pipes docs

Reply via email to