Xmpbox metadata parsing issue

Sylvere Babin Tue, 11 Jul 2023 03:24:07 -0700

Hello,

We use PDFBox to read the XMP metadata of PDF documents in the Factur-X 
standard, a Franco-German e-invoicing standard.
The XML schema corresponding to this metadata is quite simple, and retrieving 
the values are perfectly working with the 
org.apache.xmpbox.XMPMetadata.getSchema(String) method.
By default, the prefix is fx :


<rdf:Description 
xmlns:fx="urn:factur-x:pdfa:CrossIndustryDocument:invoice:1p0#" rdf:about="">
      <fx:DocumentType>INVOICE</fx:DocumentType>
      <fx:DocumentFileName>factur-x.xml</fx:DocumentFileName>
      <fx:Version>1.0</fx:Version>
      <fx:ConformanceLevel>BASIC</fx:ConformanceLevel>
</rdf:Description>

In one case, there were a document with two schemas with the same namespace 
URI, but different prefixes (fx and zf)
I tried the org.apache.xmpbox.XMPMetadata.getSchema(String, String) method, 
which according to the documentation seems to handle this case by filtering by 
prefix.
I got a NullPointerException from this method (line 268), because the prefix of 
the Factur-x schema in the org.apache.xmpbox.XMPMetadata.schemas collection was 
null.

So, I've run tests with a hundred example files provided by the Factur-X 
consortium, and it seems that for any file, the schema with the Factur-X URI 
always gets a null prefix, regardless of whether one or more schemas exist with 
this namespace.

This raise two points :

  1.  If the prefix can be null, the getSchema(String, String) method should 
handle it.
  2.  Is the Factur-X metadata specification a correct XMP standard, or is 
there a bug in the prefix parsing ?

Here's the PDF document : [Icône pdf]  
pdfExemple.pdf<https://cegidgroup-my.sharepoint.com/:b:/g/personal/sbabin_cegid_com/EVN8vpGbR1pEvaOuoIjyvfQBuhV1ZWFlYfAIKMfuAhd6Aw?e=cahEv2>
Here's the code I use to retrieve the Factur-X metadata values :

import java.io.File;
import java.io.IOException;
import java.io.InputStream;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.common.PDMetadata;
import org.apache.xmpbox.XMPMetadata;
import org.apache.xmpbox.schema.XMPSchema;
import org.apache.xmpbox.xml.DomXmpParser;
import org.apache.xmpbox.xml.XmpParsingException;

public class FacturX {

       public static void main(String[] args) throws XmpParsingException, 
IOException {
             try {
                    File finputFile = new File(args[0]);

                    PDDocument doc = PDDocument.load(finputFile);
                    PDDocumentCatalog catalog = doc.getDocumentCatalog();
                    PDMetadata m = catalog.getMetadata();

                    InputStream xmlInputStream = m.createInputStream();

                    DomXmpParser p = new DomXmpParser();
                    p.setStrictParsing(false);
                    XMPMetadata metadata = p.parse(xmlInputStream);

                    // Getting the factur-x schema with the default "fx" prefix 
(case of two factur-x schemas with different prefixes)
                    XMPSchema fx = metadata.getSchema("fx", 
"urn:factur-x:pdfa:CrossIndustryDocument:invoice:1p0#");

                    // If there is no schema with fx prefix, searching for the 
schema only with the namespace URI
                    if (fx == null) {
                           fx = 
metadata.getSchema("urn:factur-x:pdfa:CrossIndustryDocument:invoice:1p0#");
                    }

                    if (fx == null) {
                           System.out.println("This PDF document is not a valid 
Factur-X file");
                    } else {
                           String conformanceLevel = 
fx.getUnqualifiedTextPropertyValue("ConformanceLevel");
                    String documentType = 
fx.getUnqualifiedTextPropertyValue("DocumentType");
                    String version = 
fx.getUnqualifiedTextPropertyValue("Version");
                    String documentFileName = 
fx.getUnqualifiedTextPropertyValue("DocumentFileName");
                    }
             } catch (XmpParsingException | IOException e) {
                    e.printStackTrace();
             }
       }
}

Thanks for your help,

Sylvère Babin
Developer



Cegid est susceptible d'effectuer un traitement sur vos données personnelles à 
des fins de gestion de notre relation commerciale. Pour plus d'information, 
consultez https://www.cegid.com/fr/privacy-policy
Ce message et les pièces jointes sont confidentiels et établis à l'attention 
exclusive de ses destinataires. Toute utilisation ou diffusion, même partielle, 
non autorisée est interdite. Tout message électronique est susceptible 
d'altération; Cegid décline donc toute responsabilité au titre de ce message. 
Si vous n'êtes pas le destinataire de ce message, merci de le détruire et 
d'avertir l'expéditeur.

Cegid may process your personal data for the purpose of our business 
relationship management. For more information, please visit our website 
https://www.cegid.com/en/privacy-policy
This message and any attachments are confidential and intended solely for the 
addressees. Any unauthorized use or disclosure, either whole or partial is 
prohibited. E-mails are susceptible to alteration; Cegid shall therefore not be 
liable for the content of this message. If you are not the intended recipient 
of this message, please delete it and notify the sender.

Xmpbox metadata parsing issue

Reply via email to