This is an automated email from the ASF dual-hosted git repository.
tallison pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/tika.git
The following commit(s) were added to refs/heads/main by this push:
new 0a0b7e718d TIKA-4717 -- update/publish initial 4.0.0-SNAPSHOT docs
(#2756)
0a0b7e718d is described below
commit 0a0b7e718dc4bae7b8bbd05dad8e3165cc8fbe52
Author: Tim Allison <[email protected]>
AuthorDate: Thu Apr 9 12:54:53 2026 -0400
TIKA-4717 -- update/publish initial 4.0.0-SNAPSHOT docs (#2756)
---
docs/build-docs.sh | 53 ++++++++++++++++
.../advanced/flores-eval-20260320.txt | 0
.../pages/advanced/generative-language-model.adoc | 4 +-
.../advanced/integration-testing/tika-server.adoc | 4 +-
.../pages/advanced/language-detection-build.adoc | 2 +-
.../ROOT/pages/advanced/language-detection.adoc | 5 +-
docs/modules/ROOT/pages/developers/index.adoc | 2 +-
docs/modules/ROOT/pages/index.adoc | 4 ++
docs/modules/ROOT/pages/maintainers/site.adoc | 52 +++++++++++++---
.../pages/migration-to-4x/migrating-to-4x.adoc | 72 ++++++++++++++++++++--
docs/modules/ROOT/pages/pipes/unpack-config.adoc | 2 +-
11 files changed, 175 insertions(+), 25 deletions(-)
diff --git a/docs/build-docs.sh b/docs/build-docs.sh
new file mode 100755
index 0000000000..030ca1199d
--- /dev/null
+++ b/docs/build-docs.sh
@@ -0,0 +1,53 @@
+#!/bin/bash
+# Builds the Antora docs site with the current git commit stamped on the home
page.
+# Usage: ./build-docs.sh
+# Output: target/site/
+#
+# To publish to the tika-site SVN repo:
+# ./build-docs.sh --publish /path/to/tika-site/publish
+
+set -euo pipefail
+cd "$(dirname "$0")"
+
+COMMIT=$(git rev-parse --short HEAD)
+DATE=$(date -u +%Y-%m-%d)
+
+# Inject commit into playbook, build, restore
+sed -i "/tika-stable-version/a\\ git-commit: '${COMMIT} (${DATE})'"
antora-playbook.yml
+trap 'git checkout antora-playbook.yml' EXIT
+
+# Pass remaining args to Maven (filter out our --publish flag)
+PUBLISH_DIR=""
+MVN_ARGS=()
+while [[ $# -gt 0 ]]; do
+ case $1 in
+ --publish)
+ PUBLISH_DIR="$2"
+ shift 2
+ ;;
+ *)
+ MVN_ARGS+=("$1")
+ shift
+ ;;
+ esac
+done
+
+../mvnw antora:antora "${MVN_ARGS[@]}"
+
+echo "Site built at: target/site/"
+echo "Commit: ${COMMIT} (${DATE})"
+
+if [[ -n "${PUBLISH_DIR}" ]]; then
+ # Flatten: skip the 'tika/' component directory so URLs are
/docs/4.0.0-SNAPSHOT/
+ # Copy UI assets one level above docs/ since HTML uses ../../_/ relative
paths
+ DOCS_DIR="${PUBLISH_DIR}/docs"
+ mkdir -p "${DOCS_DIR}"
+ cp -r target/site/tika/* "${DOCS_DIR}/"
+ cp -r target/site/_/ "${PUBLISH_DIR}/_/"
+ # Fix the root redirect to match flattened layout
+ sed 's|tika/||g' target/site/index.html > "${DOCS_DIR}/index.html"
+ sed 's|/docs/tika/|/docs/|g' target/site/sitemap.xml >
"${DOCS_DIR}/sitemap.xml"
+ cp target/site/404.html "${DOCS_DIR}/"
+ cp target/site/search-index.js "${DOCS_DIR}/"
+ echo "Published to: ${DOCS_DIR}/"
+fi
diff --git a/docs/modules/ROOT/pages/advanced/flores-eval-20260320.txt
b/docs/modules/ROOT/attachments/advanced/flores-eval-20260320.txt
similarity index 100%
rename from docs/modules/ROOT/pages/advanced/flores-eval-20260320.txt
rename to docs/modules/ROOT/attachments/advanced/flores-eval-20260320.txt
diff --git a/docs/modules/ROOT/pages/advanced/generative-language-model.adoc
b/docs/modules/ROOT/pages/advanced/generative-language-model.adoc
index 7c2b5fa0ce..8d1b0ebb59 100644
--- a/docs/modules/ROOT/pages/advanced/generative-language-model.adoc
+++ b/docs/modules/ROOT/pages/advanced/generative-language-model.adoc
@@ -221,8 +221,8 @@ Training performs two passes:
2. **Calibration pass** — re-scores training sentences to compute per-language
μ and σ (Welford's online algorithm), stored for z-score computation at
runtime.
-The corpus can be in Wikipedia dump format (`corpusDir/{code}/sentences.txt`)
-or flat format (`corpusDir/{code}` with one sentence per line).
+The corpus can be in Wikipedia dump format (`corpusDir/\{code}/sentences.txt`)
+or flat format (`corpusDir/\{code}` with one sentence per line).
Use `--max-per-lang N` (default 500,000) to cap sentences per language.
== Evaluation Tools
diff --git
a/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc
b/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc
index 85bca5f1fa..b536701ebd 100644
--- a/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc
+++ b/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc
@@ -111,7 +111,7 @@ curl -s -X PUT -H "Accept: application/json" -T testPDF.pdf
http://localhost:999
*Expected:* JSON object with metadata only (no content).
-=== Test 8: PUT /meta/{field}
+=== Test 8: PUT /meta/\{field}
[source,bash]
----
@@ -379,7 +379,7 @@ The following endpoints were tested and verified working:
|`/tika/xml` |PUT |PASS
|`/tika/json` |PUT |PASS
|`/meta` |PUT |PASS
-|`/meta/{field}` |PUT |PASS
+|`/meta/\{field}` |PUT |PASS
|`/rmeta` |PUT |PASS
|`/rmeta/text` |PUT |PASS
|`/language/stream` |PUT |PASS
diff --git a/docs/modules/ROOT/pages/advanced/language-detection-build.adoc
b/docs/modules/ROOT/pages/advanced/language-detection-build.adoc
index 1fdd8b3a8a..1b999784bc 100644
--- a/docs/modules/ROOT/pages/advanced/language-detection-build.adoc
+++ b/docs/modules/ROOT/pages/advanced/language-detection-build.adoc
@@ -363,7 +363,7 @@ new file, and remove the old binary.
Results on the https://github.com/facebookresearch/flores[FLORES-200] dev set
(204 test languages, 997 sentences each). All scores are macro-averaged F1.
-Raw eval output:
xref:advanced/flores-eval-20260320.txt[flores-eval-20260320.txt].
+Raw eval output:
link:{attachmentsdir}/advanced/flores-eval-20260320.txt[flores-eval-20260320.txt].
==== Coverage-adjusted accuracy (each detector on its own supported languages)
diff --git a/docs/modules/ROOT/pages/advanced/language-detection.adoc
b/docs/modules/ROOT/pages/advanced/language-detection.adoc
index 2ed1cd2d9a..9fdbe3c551 100644
--- a/docs/modules/ROOT/pages/advanced/language-detection.adoc
+++ b/docs/modules/ROOT/pages/advanced/language-detection.adoc
@@ -287,7 +287,7 @@ numeric check, keeping the language detection hot path fast.
The language detector draws on several well-established techniques.
-[bibliography]
+[bibliography%unordered]
- [[[cavnar1994]]] W. B. Cavnar and J. M. Trenkle,
"N-Gram-Based Text Categorization,"
in _Proceedings of the Third Annual Symposium on Document Analysis and
@@ -344,8 +344,7 @@ The language detector draws on several well-established
techniques.
Current models (v7+) use Wikipedia dumps as the primary corpus. +
https://aclanthology.org/L12-1154/
-- [[[jacob2018]]] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard,
- H. Adam, and D. Kalenichenko,
+- [[[jacob2018]]] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H.
Adam, and D. Kalenichenko,
"Quantization and Training of Neural Networks for Efficient
Integer-Arithmetic-Only Inference,"
in _Proceedings of the IEEE Conference on Computer Vision and Pattern
diff --git a/docs/modules/ROOT/pages/developers/index.adoc
b/docs/modules/ROOT/pages/developers/index.adoc
index 08e56a7065..e72c12747b 100644
--- a/docs/modules/ROOT/pages/developers/index.adoc
+++ b/docs/modules/ROOT/pages/developers/index.adoc
@@ -20,7 +20,7 @@ with custom parsers, detectors, and other components.
== Topics
-* xref:serialization.adoc[Serialization and Configuration] - JSON
configuration,
+* xref:developers/serialization.adoc[Serialization and Configuration] - JSON
configuration,
@TikaComponent annotation, and creating custom components
== Coming Soon
diff --git a/docs/modules/ROOT/pages/index.adoc
b/docs/modules/ROOT/pages/index.adoc
index ee46a9d07e..421d524ce0 100644
--- a/docs/modules/ROOT/pages/index.adoc
+++ b/docs/modules/ROOT/pages/index.adoc
@@ -41,3 +41,7 @@ xref:using-tika/index.adoc[Using Tika] to choose your
integration method.
Apache Tika is an Apache Software Foundation project, formerly a subproject of
Apache Lucene.
+ifdef::git-commit[]
+[.small]#Built from commit: `{git-commit}`#
+endif::[]
+
diff --git a/docs/modules/ROOT/pages/maintainers/site.adoc
b/docs/modules/ROOT/pages/maintainers/site.adoc
index 398c7611fc..2a86d9231d 100644
--- a/docs/modules/ROOT/pages/maintainers/site.adoc
+++ b/docs/modules/ROOT/pages/maintainers/site.adoc
@@ -41,6 +41,18 @@ mvn antora:antora
The generated site will be at `docs/target/site/`.
+To stamp the build with the current commit hash (shown on the home page),
+add `git-commit` to the attributes in `antora-playbook.yml`:
+
+[source,yaml]
+----
+asciidoc:
+ attributes:
+ git-commit: 'abc1234'
+----
+
+Or pass it on the command line when you have a playbook that supports CLI
attributes.
+
=== Previewing the Site
**Option 1: Python HTTP server (recommended)**
@@ -99,7 +111,29 @@ Documentation versions are managed through Git branches
with the `docs/` prefix.
The playbook (`antora-playbook.yml`) is configured to build all `docs/*`
branches automatically.
-=== Publishing a New Release
+=== Publishing to the Site
+
+Use `build-docs.sh` with the `--publish` flag to build and copy to the site
SVN checkout:
+
+[source,bash]
+----
+cd docs
+./build-docs.sh --publish /path/to/tika-site/publish
+
+# Then in the SVN checkout:
+cd /path/to/tika-site
+svn add publish/docs publish/_ --force
+svn commit -m "Publish 4.0.0-SNAPSHOT docs"
+----
+
+This builds the Antora site, stamps the git commit on the home page, and copies
+the output to the site with the correct directory layout:
+
+* `publish/docs/4.0.0-SNAPSHOT/` -- the documentation pages
+* `publish/_/` -- CSS, JS, fonts (shared across versions)
+* `publish/docs/index.html` -- redirect to latest version
+
+=== Publishing a Release
When releasing a new version (e.g., 4.0.0):
@@ -116,14 +150,13 @@ sed -i "s/4.0.0-SNAPSHOT/4.0.0/" docs/antora.yml
git commit -am "Set docs version to 4.0.0"
git push origin docs/4.0.0
-# 4. Build the site
+# 4. Build and publish
cd docs
-mvn antora:antora
+./build-docs.sh --publish /path/to/tika-site/publish
-# 5. Publish to SVN
-cp -r target/site/* ~/tika-site/4.x/
-cd ~/tika-site
-svn add 4.x --force
+# 5. Commit to SVN
+cd /path/to/tika-site
+svn add publish/docs publish/_ --force
svn commit -m "Publish 4.0.0 docs"
----
@@ -145,9 +178,8 @@ git push origin docs/4.0.0
# 4. Rebuild and republish
cd docs
-mvn antora:antora
-cp -r target/site/* ~/tika-site/4.x/
-cd ~/tika-site
+./build-docs.sh --publish /path/to/tika-site/publish
+cd /path/to/tika-site
svn commit -m "Update 4.0.0 docs"
----
diff --git a/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc
b/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc
index 5c963f4809..34ef91d778 100644
--- a/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc
+++ b/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc
@@ -120,10 +120,16 @@ WARNING: The configuration options for `PDFParser` and
`TesseractOCRParser` have
significantly in 4.x. The automatic converter will migrate your parameter
names, but you
should review the updated documentation to ensure your configuration is
optimal.
-See:
+See the xref:configuration/index.adoc[Configuration] section for full details,
including:
-* xref:configuration/parsers/pdf-parser.adoc[PDFParser Configuration] -
Updated options for PDF parsing
-* xref:configuration/parsers/tesseract-ocr-parser.adoc[TesseractOCRParser
Configuration] - Updated OCR options
+* xref:configuration/parsers/pdf-parser.adoc[PDFParser Configuration]
+* xref:configuration/parsers/tesseract-ocr-parser.adoc[TesseractOCRParser
Configuration]
+* xref:configuration/parsers/tess4j-parser.adoc[Tess4J OCR (In-Process)
Configuration]
+* xref:configuration/parsers/vlm-parsers.adoc[VLM Parsers (Claude, Gemini,
OpenAI)]
+* xref:configuration/parsers/external-parser.adoc[External Parser (ffmpeg,
exiftool, etc.)]
+
+For the general serialization model and how JSON configuration works, see
+xref:developers/serialization.adoc[Serialization and Configuration].
=== Full Configuration Example
@@ -147,8 +153,64 @@ a full table of changes and code migration examples.
== API Changes
-// TODO: Document API changes
+=== TikaConfig replaced by TikaLoader
+
+`TikaConfig` has been removed. Use `TikaLoader` from `tika-serialization`
instead.
+
+**3.x:**
+[source,java]
+----
+TikaConfig config = new TikaConfig(getClass().getClassLoader());
+Parser parser = config.getParser();
+Detector detector = config.getDetector();
+AutoDetectParser autoDetect = new AutoDetectParser(config);
+----
+
+**4.x:**
+[source,java]
+----
+// Default configuration (SPI-discovered components)
+TikaLoader loader = TikaLoader.loadDefault(getClass().getClassLoader());
+
+// Or from a JSON config file
+TikaLoader loader = TikaLoader.load(Path.of("tika-config.json"));
+
+// Access components
+Parser parser = loader.loadParsers();
+Detector detector = loader.loadDetectors();
+Parser autoDetect = loader.loadAutoDetectParser();
+ParseContext context = loader.loadParseContext();
+----
+
+NOTE: `TikaLoader` is in the `tika-serialization` module. Add
`tika-serialization`
+as a dependency if you were previously only depending on `tika-core`.
+See xref:developers/serialization.adoc[Serialization and Configuration] for
+the full `TikaLoader` API.
+
+For simple use cases, the `Tika` facade and `DefaultParser` still work without
+`TikaLoader`:
+
+[source,java]
+----
+// Simple facade (unchanged from 3.x)
+Tika tika = new Tika();
+String text = tika.parseToString(file);
+
+// Direct parser use (unchanged from 3.x)
+Parser parser = new DefaultParser();
+----
+
+=== ExternalParser
+
+The legacy `ExternalParser` and `CompositeExternalParser` have been removed.
+External parsers must now be explicitly configured via JSON. See
+xref:configuration/parsers/external-parser.adoc[External Parser Configuration]
+for details.
== Deprecations and Removals
-// TODO: Document deprecated and removed features
+* `TikaConfig` -- replaced by `TikaLoader`
+* `CompositeExternalParser` -- external parsers now require explicit JSON
configuration
+* `ExternalParsersFactory` and XML-based external parser auto-discovery
+* DOM-based OOXML extractors (`XWPFWordExtractorDecorator`,
`XSLFPowerPointExtractorDecorator`)
+ -- SAX-based extractors are now the only implementation
diff --git a/docs/modules/ROOT/pages/pipes/unpack-config.adoc
b/docs/modules/ROOT/pages/pipes/unpack-config.adoc
index 45f15cdb1c..f3bc1fe5f4 100644
--- a/docs/modules/ROOT/pages/pipes/unpack-config.adoc
+++ b/docs/modules/ROOT/pages/pipes/unpack-config.adoc
@@ -175,7 +175,7 @@ This limits extraction to 100MB total.
== Key Base Strategies
-`DEFAULT`:: Output key is `{containerKey}-{embeddedIdPrefix}{id}{suffix}`
+`DEFAULT`:: Output key is `\{containerKey}-\{embeddedIdPrefix}\{id}\{suffix}`
`CUSTOM`:: Output key uses `emitKeyBase` as the prefix.
== Safety Limits