This is an automated email from the ASF dual-hosted git repository. tallison pushed a commit to branch TIKA-4717-publish-docs-to-site in repository https://gitbox.apache.org/repos/asf/tika.git
commit fce4c25fc70ab669fecad0356487431553799d2e Author: tallison <[email protected]> AuthorDate: Thu Apr 9 12:48:18 2026 -0400 TIKA-4717 -- update/publish initial 4.0.0-SNAPSHOT docs --- docs/build-docs.sh | 53 ++++++++++++++++ .../advanced/flores-eval-20260320.txt | 0 .../pages/advanced/generative-language-model.adoc | 4 +- .../advanced/integration-testing/tika-server.adoc | 4 +- .../pages/advanced/language-detection-build.adoc | 2 +- .../ROOT/pages/advanced/language-detection.adoc | 5 +- docs/modules/ROOT/pages/developers/index.adoc | 2 +- docs/modules/ROOT/pages/index.adoc | 4 ++ docs/modules/ROOT/pages/maintainers/site.adoc | 52 +++++++++++++--- .../pages/migration-to-4x/migrating-to-4x.adoc | 72 ++++++++++++++++++++-- docs/modules/ROOT/pages/pipes/unpack-config.adoc | 2 +- 11 files changed, 175 insertions(+), 25 deletions(-) diff --git a/docs/build-docs.sh b/docs/build-docs.sh new file mode 100755 index 0000000000..030ca1199d --- /dev/null +++ b/docs/build-docs.sh @@ -0,0 +1,53 @@ +#!/bin/bash +# Builds the Antora docs site with the current git commit stamped on the home page. +# Usage: ./build-docs.sh +# Output: target/site/ +# +# To publish to the tika-site SVN repo: +# ./build-docs.sh --publish /path/to/tika-site/publish + +set -euo pipefail +cd "$(dirname "$0")" + +COMMIT=$(git rev-parse --short HEAD) +DATE=$(date -u +%Y-%m-%d) + +# Inject commit into playbook, build, restore +sed -i "/tika-stable-version/a\\ git-commit: '${COMMIT} (${DATE})'" antora-playbook.yml +trap 'git checkout antora-playbook.yml' EXIT + +# Pass remaining args to Maven (filter out our --publish flag) +PUBLISH_DIR="" +MVN_ARGS=() +while [[ $# -gt 0 ]]; do + case $1 in + --publish) + PUBLISH_DIR="$2" + shift 2 + ;; + *) + MVN_ARGS+=("$1") + shift + ;; + esac +done + +../mvnw antora:antora "${MVN_ARGS[@]}" + +echo "Site built at: target/site/" +echo "Commit: ${COMMIT} (${DATE})" + +if [[ -n "${PUBLISH_DIR}" ]]; then + # Flatten: skip the 'tika/' component directory so URLs are /docs/4.0.0-SNAPSHOT/ + # Copy UI assets one level above docs/ since HTML uses ../../_/ relative paths + DOCS_DIR="${PUBLISH_DIR}/docs" + mkdir -p "${DOCS_DIR}" + cp -r target/site/tika/* "${DOCS_DIR}/" + cp -r target/site/_/ "${PUBLISH_DIR}/_/" + # Fix the root redirect to match flattened layout + sed 's|tika/||g' target/site/index.html > "${DOCS_DIR}/index.html" + sed 's|/docs/tika/|/docs/|g' target/site/sitemap.xml > "${DOCS_DIR}/sitemap.xml" + cp target/site/404.html "${DOCS_DIR}/" + cp target/site/search-index.js "${DOCS_DIR}/" + echo "Published to: ${DOCS_DIR}/" +fi diff --git a/docs/modules/ROOT/pages/advanced/flores-eval-20260320.txt b/docs/modules/ROOT/attachments/advanced/flores-eval-20260320.txt similarity index 100% rename from docs/modules/ROOT/pages/advanced/flores-eval-20260320.txt rename to docs/modules/ROOT/attachments/advanced/flores-eval-20260320.txt diff --git a/docs/modules/ROOT/pages/advanced/generative-language-model.adoc b/docs/modules/ROOT/pages/advanced/generative-language-model.adoc index 7c2b5fa0ce..8d1b0ebb59 100644 --- a/docs/modules/ROOT/pages/advanced/generative-language-model.adoc +++ b/docs/modules/ROOT/pages/advanced/generative-language-model.adoc @@ -221,8 +221,8 @@ Training performs two passes: 2. **Calibration pass** — re-scores training sentences to compute per-language μ and σ (Welford's online algorithm), stored for z-score computation at runtime. -The corpus can be in Wikipedia dump format (`corpusDir/{code}/sentences.txt`) -or flat format (`corpusDir/{code}` with one sentence per line). +The corpus can be in Wikipedia dump format (`corpusDir/\{code}/sentences.txt`) +or flat format (`corpusDir/\{code}` with one sentence per line). Use `--max-per-lang N` (default 500,000) to cap sentences per language. == Evaluation Tools diff --git a/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc b/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc index 85bca5f1fa..b536701ebd 100644 --- a/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc +++ b/docs/modules/ROOT/pages/advanced/integration-testing/tika-server.adoc @@ -111,7 +111,7 @@ curl -s -X PUT -H "Accept: application/json" -T testPDF.pdf http://localhost:999 *Expected:* JSON object with metadata only (no content). -=== Test 8: PUT /meta/{field} +=== Test 8: PUT /meta/\{field} [source,bash] ---- @@ -379,7 +379,7 @@ The following endpoints were tested and verified working: |`/tika/xml` |PUT |PASS |`/tika/json` |PUT |PASS |`/meta` |PUT |PASS -|`/meta/{field}` |PUT |PASS +|`/meta/\{field}` |PUT |PASS |`/rmeta` |PUT |PASS |`/rmeta/text` |PUT |PASS |`/language/stream` |PUT |PASS diff --git a/docs/modules/ROOT/pages/advanced/language-detection-build.adoc b/docs/modules/ROOT/pages/advanced/language-detection-build.adoc index 1fdd8b3a8a..1b999784bc 100644 --- a/docs/modules/ROOT/pages/advanced/language-detection-build.adoc +++ b/docs/modules/ROOT/pages/advanced/language-detection-build.adoc @@ -363,7 +363,7 @@ new file, and remove the old binary. Results on the https://github.com/facebookresearch/flores[FLORES-200] dev set (204 test languages, 997 sentences each). All scores are macro-averaged F1. -Raw eval output: xref:advanced/flores-eval-20260320.txt[flores-eval-20260320.txt]. +Raw eval output: link:{attachmentsdir}/advanced/flores-eval-20260320.txt[flores-eval-20260320.txt]. ==== Coverage-adjusted accuracy (each detector on its own supported languages) diff --git a/docs/modules/ROOT/pages/advanced/language-detection.adoc b/docs/modules/ROOT/pages/advanced/language-detection.adoc index 2ed1cd2d9a..9fdbe3c551 100644 --- a/docs/modules/ROOT/pages/advanced/language-detection.adoc +++ b/docs/modules/ROOT/pages/advanced/language-detection.adoc @@ -287,7 +287,7 @@ numeric check, keeping the language detection hot path fast. The language detector draws on several well-established techniques. -[bibliography] +[bibliography%unordered] - [[[cavnar1994]]] W. B. Cavnar and J. M. Trenkle, "N-Gram-Based Text Categorization," in _Proceedings of the Third Annual Symposium on Document Analysis and @@ -344,8 +344,7 @@ The language detector draws on several well-established techniques. Current models (v7+) use Wikipedia dumps as the primary corpus. + https://aclanthology.org/L12-1154/ -- [[[jacob2018]]] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, - H. Adam, and D. Kalenichenko, +- [[[jacob2018]]] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference," in _Proceedings of the IEEE Conference on Computer Vision and Pattern diff --git a/docs/modules/ROOT/pages/developers/index.adoc b/docs/modules/ROOT/pages/developers/index.adoc index 08e56a7065..e72c12747b 100644 --- a/docs/modules/ROOT/pages/developers/index.adoc +++ b/docs/modules/ROOT/pages/developers/index.adoc @@ -20,7 +20,7 @@ with custom parsers, detectors, and other components. == Topics -* xref:serialization.adoc[Serialization and Configuration] - JSON configuration, +* xref:developers/serialization.adoc[Serialization and Configuration] - JSON configuration, @TikaComponent annotation, and creating custom components == Coming Soon diff --git a/docs/modules/ROOT/pages/index.adoc b/docs/modules/ROOT/pages/index.adoc index ee46a9d07e..421d524ce0 100644 --- a/docs/modules/ROOT/pages/index.adoc +++ b/docs/modules/ROOT/pages/index.adoc @@ -41,3 +41,7 @@ xref:using-tika/index.adoc[Using Tika] to choose your integration method. Apache Tika is an Apache Software Foundation project, formerly a subproject of Apache Lucene. +ifdef::git-commit[] +[.small]#Built from commit: `{git-commit}`# +endif::[] + diff --git a/docs/modules/ROOT/pages/maintainers/site.adoc b/docs/modules/ROOT/pages/maintainers/site.adoc index 398c7611fc..2a86d9231d 100644 --- a/docs/modules/ROOT/pages/maintainers/site.adoc +++ b/docs/modules/ROOT/pages/maintainers/site.adoc @@ -41,6 +41,18 @@ mvn antora:antora The generated site will be at `docs/target/site/`. +To stamp the build with the current commit hash (shown on the home page), +add `git-commit` to the attributes in `antora-playbook.yml`: + +[source,yaml] +---- +asciidoc: + attributes: + git-commit: 'abc1234' +---- + +Or pass it on the command line when you have a playbook that supports CLI attributes. + === Previewing the Site **Option 1: Python HTTP server (recommended)** @@ -99,7 +111,29 @@ Documentation versions are managed through Git branches with the `docs/` prefix. The playbook (`antora-playbook.yml`) is configured to build all `docs/*` branches automatically. -=== Publishing a New Release +=== Publishing to the Site + +Use `build-docs.sh` with the `--publish` flag to build and copy to the site SVN checkout: + +[source,bash] +---- +cd docs +./build-docs.sh --publish /path/to/tika-site/publish + +# Then in the SVN checkout: +cd /path/to/tika-site +svn add publish/docs publish/_ --force +svn commit -m "Publish 4.0.0-SNAPSHOT docs" +---- + +This builds the Antora site, stamps the git commit on the home page, and copies +the output to the site with the correct directory layout: + +* `publish/docs/4.0.0-SNAPSHOT/` -- the documentation pages +* `publish/_/` -- CSS, JS, fonts (shared across versions) +* `publish/docs/index.html` -- redirect to latest version + +=== Publishing a Release When releasing a new version (e.g., 4.0.0): @@ -116,14 +150,13 @@ sed -i "s/4.0.0-SNAPSHOT/4.0.0/" docs/antora.yml git commit -am "Set docs version to 4.0.0" git push origin docs/4.0.0 -# 4. Build the site +# 4. Build and publish cd docs -mvn antora:antora +./build-docs.sh --publish /path/to/tika-site/publish -# 5. Publish to SVN -cp -r target/site/* ~/tika-site/4.x/ -cd ~/tika-site -svn add 4.x --force +# 5. Commit to SVN +cd /path/to/tika-site +svn add publish/docs publish/_ --force svn commit -m "Publish 4.0.0 docs" ---- @@ -145,9 +178,8 @@ git push origin docs/4.0.0 # 4. Rebuild and republish cd docs -mvn antora:antora -cp -r target/site/* ~/tika-site/4.x/ -cd ~/tika-site +./build-docs.sh --publish /path/to/tika-site/publish +cd /path/to/tika-site svn commit -m "Update 4.0.0 docs" ---- diff --git a/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc b/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc index 5c963f4809..34ef91d778 100644 --- a/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc +++ b/docs/modules/ROOT/pages/migration-to-4x/migrating-to-4x.adoc @@ -120,10 +120,16 @@ WARNING: The configuration options for `PDFParser` and `TesseractOCRParser` have significantly in 4.x. The automatic converter will migrate your parameter names, but you should review the updated documentation to ensure your configuration is optimal. -See: +See the xref:configuration/index.adoc[Configuration] section for full details, including: -* xref:configuration/parsers/pdf-parser.adoc[PDFParser Configuration] - Updated options for PDF parsing -* xref:configuration/parsers/tesseract-ocr-parser.adoc[TesseractOCRParser Configuration] - Updated OCR options +* xref:configuration/parsers/pdf-parser.adoc[PDFParser Configuration] +* xref:configuration/parsers/tesseract-ocr-parser.adoc[TesseractOCRParser Configuration] +* xref:configuration/parsers/tess4j-parser.adoc[Tess4J OCR (In-Process) Configuration] +* xref:configuration/parsers/vlm-parsers.adoc[VLM Parsers (Claude, Gemini, OpenAI)] +* xref:configuration/parsers/external-parser.adoc[External Parser (ffmpeg, exiftool, etc.)] + +For the general serialization model and how JSON configuration works, see +xref:developers/serialization.adoc[Serialization and Configuration]. === Full Configuration Example @@ -147,8 +153,64 @@ a full table of changes and code migration examples. == API Changes -// TODO: Document API changes +=== TikaConfig replaced by TikaLoader + +`TikaConfig` has been removed. Use `TikaLoader` from `tika-serialization` instead. + +**3.x:** +[source,java] +---- +TikaConfig config = new TikaConfig(getClass().getClassLoader()); +Parser parser = config.getParser(); +Detector detector = config.getDetector(); +AutoDetectParser autoDetect = new AutoDetectParser(config); +---- + +**4.x:** +[source,java] +---- +// Default configuration (SPI-discovered components) +TikaLoader loader = TikaLoader.loadDefault(getClass().getClassLoader()); + +// Or from a JSON config file +TikaLoader loader = TikaLoader.load(Path.of("tika-config.json")); + +// Access components +Parser parser = loader.loadParsers(); +Detector detector = loader.loadDetectors(); +Parser autoDetect = loader.loadAutoDetectParser(); +ParseContext context = loader.loadParseContext(); +---- + +NOTE: `TikaLoader` is in the `tika-serialization` module. Add `tika-serialization` +as a dependency if you were previously only depending on `tika-core`. +See xref:developers/serialization.adoc[Serialization and Configuration] for +the full `TikaLoader` API. + +For simple use cases, the `Tika` facade and `DefaultParser` still work without +`TikaLoader`: + +[source,java] +---- +// Simple facade (unchanged from 3.x) +Tika tika = new Tika(); +String text = tika.parseToString(file); + +// Direct parser use (unchanged from 3.x) +Parser parser = new DefaultParser(); +---- + +=== ExternalParser + +The legacy `ExternalParser` and `CompositeExternalParser` have been removed. +External parsers must now be explicitly configured via JSON. See +xref:configuration/parsers/external-parser.adoc[External Parser Configuration] +for details. == Deprecations and Removals -// TODO: Document deprecated and removed features +* `TikaConfig` -- replaced by `TikaLoader` +* `CompositeExternalParser` -- external parsers now require explicit JSON configuration +* `ExternalParsersFactory` and XML-based external parser auto-discovery +* DOM-based OOXML extractors (`XWPFWordExtractorDecorator`, `XSLFPowerPointExtractorDecorator`) + -- SAX-based extractors are now the only implementation diff --git a/docs/modules/ROOT/pages/pipes/unpack-config.adoc b/docs/modules/ROOT/pages/pipes/unpack-config.adoc index 45f15cdb1c..f3bc1fe5f4 100644 --- a/docs/modules/ROOT/pages/pipes/unpack-config.adoc +++ b/docs/modules/ROOT/pages/pipes/unpack-config.adoc @@ -175,7 +175,7 @@ This limits extraction to 100MB total. == Key Base Strategies -`DEFAULT`:: Output key is `{containerKey}-{embeddedIdPrefix}{id}{suffix}` +`DEFAULT`:: Output key is `\{containerKey}-\{embeddedIdPrefix}\{id}\{suffix}` `CUSTOM`:: Output key uses `emitKeyBase` as the prefix. == Safety Limits
