(tika) branch main updated: TIKA-4717 -- update/publish initial 4.0.0-SNAPSHOT docs (#2756)

tallison Thu, 09 Apr 2026 09:55:19 -0700

This is an automated email from the ASF dual-hosted git repository.

tallison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git



The following commit(s) were added to refs/heads/main by this push:
     new 0a0b7e718d TIKA-4717 -- update/publish initial 4.0.0-SNAPSHOT docs 
(#2756)
0a0b7e718d is described below

commit 0a0b7e718dc4bae7b8bbd05dad8e3165cc8fbe52
Author: Tim Allison <[email protected]>
AuthorDate: Thu Apr 9 12:54:53 2026 -0400

    TIKA-4717 -- update/publish initial 4.0.0-SNAPSHOT docs (#2756)
---
 docs/build-docs.sh                                 | 53 ++++++++++++++++
 .../advanced/flores-eval-20260320.txt              |  0
 .../pages/advanced/generative-language-model.adoc  |  4 +-
 .../advanced/integration-testing/tika-server.adoc  |  4 +-
 .../pages/advanced/language-detection-build.adoc   |  2 +-
 .../ROOT/pages/advanced/language-detection.adoc    |  5 +-
 docs/modules/ROOT/pages/developers/index.adoc      |  2 +-
 docs/modules/ROOT/pages/index.adoc                 |  4 ++
 docs/modules/ROOT/pages/maintainers/site.adoc      | 52 +++++++++++++---
 .../pages/migration-to-4x/migrating-to-4x.adoc     | 72 ++++++++++++++++++++--
 docs/modules/ROOT/pages/pipes/unpack-config.adoc   |  2 +-
 11 files changed, 175 insertions(+), 25 deletions(-)

diff --git a/docs/build-docs.sh b/docs/build-docs.sh
new file mode 100755
index 0000000000..030ca1199d
--- /dev/null
+++ b/docs/build-docs.sh
@@ -0,0 +1,53 @@
+#!/bin/bash
+# Builds the Antora docs site with the current git commit stamped on the home 
page.
+# Usage: ./build-docs.sh
+# Output: target/site/
+#
+# To publish to the tika-site SVN repo:
+#   ./build-docs.sh --publish /path/to/tika-site/publish
+
+set -euo pipefail
+cd "$(dirname "$0")"
+
+COMMIT=$(git rev-parse --short HEAD)
+DATE=$(date -u +%Y-%m-%d)
+
+# Inject commit into playbook, build, restore
+sed -i "/tika-stable-version/a\\    git-commit: '${COMMIT} (${DATE})'" 
antora-playbook.yml
+trap 'git checkout antora-playbook.yml' EXIT
+
+# Pass remaining args to Maven (filter out our --publish flag)
+PUBLISH_DIR=""
+MVN_ARGS=()
+while [[ $# -gt 0 ]]; do
+    case $1 in
+        --publish)
+            PUBLISH_DIR="$2"
+            shift 2
+            ;;
+        *)
+            MVN_ARGS+=("$1")
+            shift
+            ;;
+    esac
+done
+
+../mvnw antora:antora "${MVN_ARGS[@]}"
+
+echo "Site built at: target/site/"
+echo "Commit: ${COMMIT} (${DATE})"
+
+if [[ -n "${PUBLISH_DIR}" ]]; then
+    # Flatten: skip the 'tika/' component directory so URLs are 
/docs/4.0.0-SNAPSHOT/
+    # Copy UI assets one level above docs/ since HTML uses ../../_/ relative 
paths
+    DOCS_DIR="${PUBLISH_DIR}/docs"
+    mkdir -p "${DOCS_DIR}"
+    cp -r target/site/tika/* "${DOCS_DIR}/"
+    cp -r target/site/_/ "${PUBLISH_DIR}/_/"
+    # Fix the root redirect to match flattened layout
+    sed 's|tika/||g' target/site/index.html > "${DOCS_DIR}/index.html"
+    sed 's|/docs/tika/|/docs/|g' target/site/sitemap.xml > 
"${DOCS_DIR}/sitemap.xml"
+    cp target/site/404.html "${DOCS_DIR}/"
+    cp target/site/search-index.js "${DOCS_DIR}/"
+    echo "Published to: ${DOCS_DIR}/"
+fi
diff --git a/docs/modules/ROOT/pages/advanced/flores-eval-20260320.txt 
b/docs/modules/ROOT/attachments/advanced/flores-eval-20260320.txt
similarity index 100%
rename from docs/modules/ROOT/pages/advanced/flores-eval-20260320.txt
rename to docs/modules/ROOT/attachments/advanced/flores-eval-20260320.txt
diff --git a/docs/modules/ROOT/pages/advanced/generative-language-model.adoc 
b/docs/modules/ROOT/pages/advanced/generative-language-model.adoc
index 7c2b5fa0ce..8d1b0ebb59 100644
--- a/docs/modules/ROOT/pages/advanced/generative-language-model.adoc
+++ b/docs/modules/ROOT/pages/advanced/generative-language-model.adoc
@@ -221,8 +221,8 @@ Training performs two passes:
 2. **Calibration pass** — re-scores training sentences to compute per-language
    μ and σ (Welford's online algorithm), stored for z-score computation at 
runtime.
 
-The corpus can be in Wikipedia dump format (`corpusDir/{code}/sentences.txt`)
-or flat format (`corpusDir/{code}` with one sentence per line).
+The corpus can be in Wikipedia dump format (`corpusDir/\{code}/sentences.txt`)
+or flat format (`corpusDir/\{code}` with one sentence per line).
 Use `--max-per-lang N` (default 500,000) to cap sentences per language.
 
 == Evaluation Tools
diff --git 
a/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc 
b/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc
index 85bca5f1fa..b536701ebd 100644
--- a/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc
+++ b/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc
@@ -111,7 +111,7 @@ curl -s -X PUT -H "Accept: application/json" -T testPDF.pdf 
http://localhost:999
 
 *Expected:* JSON object with metadata only (no content).
 
-=== Test 8: PUT /meta/{field}
+=== Test 8: PUT /meta/\{field}
 
 [source,bash]
 ----
@@ -379,7 +379,7 @@ The following endpoints were tested and verified working:
 |`/tika/xml` |PUT |PASS
 |`/tika/json` |PUT |PASS
 |`/meta` |PUT |PASS
-|`/meta/{field}` |PUT |PASS
+|`/meta/\{field}` |PUT |PASS
 |`/rmeta` |PUT |PASS
 |`/rmeta/text` |PUT |PASS
 |`/language/stream` |PUT |PASS
diff --git a/docs/modules/ROOT/pages/advanced/language-detection-build.adoc 
b/docs/modules/ROOT/pages/advanced/language-detection-build.adoc
index 1fdd8b3a8a..1b999784bc 100644
--- a/docs/modules/ROOT/pages/advanced/language-detection-build.adoc
+++ b/docs/modules/ROOT/pages/advanced/language-detection-build.adoc
@@ -363,7 +363,7 @@ new file, and remove the old binary.
 
 Results on the https://github.com/facebookresearch/flores[FLORES-200] dev set
 (204 test languages, 997 sentences each). All scores are macro-averaged F1.
-Raw eval output: 
xref:advanced/flores-eval-20260320.txt[flores-eval-20260320.txt].
+Raw eval output: 
link:{attachmentsdir}/advanced/flores-eval-20260320.txt[flores-eval-20260320.txt].
 
 ==== Coverage-adjusted accuracy (each detector on its own supported languages)
 
diff --git a/docs/modules/ROOT/pages/advanced/language-detection.adoc 
b/docs/modules/ROOT/pages/advanced/language-detection.adoc
index 2ed1cd2d9a..9fdbe3c551 100644
--- a/docs/modules/ROOT/pages/advanced/language-detection.adoc
+++ b/docs/modules/ROOT/pages/advanced/language-detection.adoc
@@ -287,7 +287,7 @@ numeric check, keeping the language detection hot path fast.
 
 The language detector draws on several well-established techniques.
 
-[bibliography]
+[bibliography%unordered]
 - [[[cavnar1994]]] W. B. Cavnar and J. M. Trenkle,
   "N-Gram-Based Text Categorization,"
   in _Proceedings of the Third Annual Symposium on Document Analysis and
@@ -344,8 +344,7 @@ The language detector draws on several well-established 
techniques.
   Current models (v7+) use Wikipedia dumps as the primary corpus. +
   https://aclanthology.org/L12-1154/
 
-- [[[jacob2018]]] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard,
-  H. Adam, and D. Kalenichenko,
+- [[[jacob2018]]] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. 
Adam, and D. Kalenichenko,
   "Quantization and Training of Neural Networks for Efficient
   Integer-Arithmetic-Only Inference,"
   in _Proceedings of the IEEE Conference on Computer Vision and Pattern
diff --git a/docs/modules/ROOT/pages/developers/index.adoc 
b/docs/modules/ROOT/pages/developers/index.adoc
index 08e56a7065..e72c12747b 100644
--- a/docs/modules/ROOT/pages/developers/index.adoc
+++ b/docs/modules/ROOT/pages/developers/index.adoc
@@ -20,7 +20,7 @@ with custom parsers, detectors, and other components.
 
 == Topics
 
-* xref:serialization.adoc[Serialization and Configuration] - JSON 
configuration,
+* xref:developers/serialization.adoc[Serialization and Configuration] - JSON 
configuration,
   @TikaComponent annotation, and creating custom components
 
 == Coming Soon
diff --git a/docs/modules/ROOT/pages/index.adoc 
b/docs/modules/ROOT/pages/index.adoc
index ee46a9d07e..421d524ce0 100644
--- a/docs/modules/ROOT/pages/index.adoc
+++ b/docs/modules/ROOT/pages/index.adoc
@@ -41,3 +41,7 @@ xref:using-tika/index.adoc[Using Tika] to choose your 
integration method.
 
 Apache Tika is an Apache Software Foundation project, formerly a subproject of 
Apache Lucene.
 
+ifdef::git-commit[]
+[.small]#Built from commit: `{git-commit}`#
+endif::[]
+
diff --git a/docs/modules/ROOT/pages/maintainers/site.adoc 
b/docs/modules/ROOT/pages/maintainers/site.adoc
index 398c7611fc..2a86d9231d 100644
--- a/docs/modules/ROOT/pages/maintainers/site.adoc
+++ b/docs/modules/ROOT/pages/maintainers/site.adoc
@@ -41,6 +41,18 @@ mvn antora:antora
 
 The generated site will be at `docs/target/site/`.
 
+To stamp the build with the current commit hash (shown on the home page),
+add `git-commit` to the attributes in `antora-playbook.yml`:
+
+[source,yaml]
+----
+asciidoc:
+  attributes:
+    git-commit: 'abc1234'
+----
+
+Or pass it on the command line when you have a playbook that supports CLI 
attributes.
+
 === Previewing the Site
 
 **Option 1: Python HTTP server (recommended)**
@@ -99,7 +111,29 @@ Documentation versions are managed through Git branches 
with the `docs/` prefix.
 
 The playbook (`antora-playbook.yml`) is configured to build all `docs/*` 
branches automatically.
 
-=== Publishing a New Release
+=== Publishing to the Site
+
+Use `build-docs.sh` with the `--publish` flag to build and copy to the site 
SVN checkout:
+
+[source,bash]
+----
+cd docs
+./build-docs.sh --publish /path/to/tika-site/publish
+
+# Then in the SVN checkout:
+cd /path/to/tika-site
+svn add publish/docs publish/_ --force
+svn commit -m "Publish 4.0.0-SNAPSHOT docs"
+----
+
+This builds the Antora site, stamps the git commit on the home page, and copies
+the output to the site with the correct directory layout:
+
+* `publish/docs/4.0.0-SNAPSHOT/` -- the documentation pages
+* `publish/_/` -- CSS, JS, fonts (shared across versions)
+* `publish/docs/index.html` -- redirect to latest version
+
+=== Publishing a Release
 
 When releasing a new version (e.g., 4.0.0):
 
@@ -116,14 +150,13 @@ sed -i "s/4.0.0-SNAPSHOT/4.0.0/" docs/antora.yml
 git commit -am "Set docs version to 4.0.0"
 git push origin docs/4.0.0
 
-# 4. Build the site
+# 4. Build and publish
 cd docs
-mvn antora:antora
+./build-docs.sh --publish /path/to/tika-site/publish
 
-# 5. Publish to SVN
-cp -r target/site/* ~/tika-site/4.x/
-cd ~/tika-site
-svn add 4.x --force
+# 5. Commit to SVN
+cd /path/to/tika-site
+svn add publish/docs publish/_ --force
 svn commit -m "Publish 4.0.0 docs"
 ----
 
@@ -145,9 +178,8 @@ git push origin docs/4.0.0
 
 # 4. Rebuild and republish
 cd docs
-mvn antora:antora
-cp -r target/site/* ~/tika-site/4.x/
-cd ~/tika-site
+./build-docs.sh --publish /path/to/tika-site/publish
+cd /path/to/tika-site
 svn commit -m "Update 4.0.0 docs"
 ----
 
diff --git a/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc 
b/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc
index 5c963f4809..34ef91d778 100644
--- a/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc
+++ b/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc
@@ -120,10 +120,16 @@ WARNING: The configuration options for `PDFParser` and 
`TesseractOCRParser` have
 significantly in 4.x. The automatic converter will migrate your parameter 
names, but you
 should review the updated documentation to ensure your configuration is 
optimal.
 
-See:
+See the xref:configuration/index.adoc[Configuration] section for full details, 
including:
 
-* xref:configuration/parsers/pdf-parser.adoc[PDFParser Configuration] - 
Updated options for PDF parsing
-* xref:configuration/parsers/tesseract-ocr-parser.adoc[TesseractOCRParser 
Configuration] - Updated OCR options
+* xref:configuration/parsers/pdf-parser.adoc[PDFParser Configuration]
+* xref:configuration/parsers/tesseract-ocr-parser.adoc[TesseractOCRParser 
Configuration]
+* xref:configuration/parsers/tess4j-parser.adoc[Tess4J OCR (In-Process) 
Configuration]
+* xref:configuration/parsers/vlm-parsers.adoc[VLM Parsers (Claude, Gemini, 
OpenAI)]
+* xref:configuration/parsers/external-parser.adoc[External Parser (ffmpeg, 
exiftool, etc.)]
+
+For the general serialization model and how JSON configuration works, see
+xref:developers/serialization.adoc[Serialization and Configuration].
 
 === Full Configuration Example
 
@@ -147,8 +153,64 @@ a full table of changes and code migration examples.
 
 == API Changes
 
-// TODO: Document API changes
+=== TikaConfig replaced by TikaLoader
+
+`TikaConfig` has been removed. Use `TikaLoader` from `tika-serialization` 
instead.
+
+**3.x:**
+[source,java]
+----
+TikaConfig config = new TikaConfig(getClass().getClassLoader());
+Parser parser = config.getParser();
+Detector detector = config.getDetector();
+AutoDetectParser autoDetect = new AutoDetectParser(config);
+----
+
+**4.x:**
+[source,java]
+----
+// Default configuration (SPI-discovered components)
+TikaLoader loader = TikaLoader.loadDefault(getClass().getClassLoader());
+
+// Or from a JSON config file
+TikaLoader loader = TikaLoader.load(Path.of("tika-config.json"));
+
+// Access components
+Parser parser = loader.loadParsers();
+Detector detector = loader.loadDetectors();
+Parser autoDetect = loader.loadAutoDetectParser();
+ParseContext context = loader.loadParseContext();
+----
+
+NOTE: `TikaLoader` is in the `tika-serialization` module. Add 
`tika-serialization`
+as a dependency if you were previously only depending on `tika-core`.
+See xref:developers/serialization.adoc[Serialization and Configuration] for
+the full `TikaLoader` API.
+
+For simple use cases, the `Tika` facade and `DefaultParser` still work without
+`TikaLoader`:
+
+[source,java]
+----
+// Simple facade (unchanged from 3.x)
+Tika tika = new Tika();
+String text = tika.parseToString(file);
+
+// Direct parser use (unchanged from 3.x)
+Parser parser = new DefaultParser();
+----
+
+=== ExternalParser
+
+The legacy `ExternalParser` and `CompositeExternalParser` have been removed.
+External parsers must now be explicitly configured via JSON. See
+xref:configuration/parsers/external-parser.adoc[External Parser Configuration]
+for details.
 
 == Deprecations and Removals
 
-// TODO: Document deprecated and removed features
+* `TikaConfig` -- replaced by `TikaLoader`
+* `CompositeExternalParser` -- external parsers now require explicit JSON 
configuration
+* `ExternalParsersFactory` and XML-based external parser auto-discovery
+* DOM-based OOXML extractors (`XWPFWordExtractorDecorator`, 
`XSLFPowerPointExtractorDecorator`)
+  -- SAX-based extractors are now the only implementation
diff --git a/docs/modules/ROOT/pages/pipes/unpack-config.adoc 
b/docs/modules/ROOT/pages/pipes/unpack-config.adoc
index 45f15cdb1c..f3bc1fe5f4 100644
--- a/docs/modules/ROOT/pages/pipes/unpack-config.adoc
+++ b/docs/modules/ROOT/pages/pipes/unpack-config.adoc
@@ -175,7 +175,7 @@ This limits extraction to 100MB total.
 
 == Key Base Strategies
 
-`DEFAULT`:: Output key is `{containerKey}-{embeddedIdPrefix}{id}{suffix}`
+`DEFAULT`:: Output key is `\{containerKey}-\{embeddedIdPrefix}\{id}\{suffix}`
 `CUSTOM`:: Output key uses `emitKeyBase` as the prefix.
 
 == Safety Limits

(tika) branch main updated: TIKA-4717 -- update/publish initial 4.0.0-SNAPSHOT docs (#2756)

Reply via email to