Package: python3-lxml
Version: 4.6.3+dfsg-0.1
Severity: important
X-Debbugs-Cc: micha...@gmail.com

Dear Maintainer,

I ran into a bug that causes lxml to truncate output when using
"tostring" with encoding set to "utf8", while it works correctly when
encoding is set to "utf-8". See attached "bug.py" file with an example
to reproduce. The output under "Bad" has truncated text in the last
subfield.

I've previously reported this bug upstream in
https://bugs.launchpad.net/lxml/+bug/1944751 but further testing makes
me think that this is Debian specific: when running the attached
"bug.py" example in a new virtualenv in which I ran "pip install lxml",
and hence using the upstream binary wheel, the bug doesn't arise.

Best,
Micha

-- System Information:
Debian Release: 11.0
  APT prefers stable-updates
  APT policy: (500, 'stable-updates'), (500, 'stable-security'), (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 5.10.0-8-amd64 (SMP w/8 CPU threads)
Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8), 
LANGUAGE=en_GB:en
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages python3-lxml depends on:
ii  libc6       2.31-13
ii  libxml2     2.9.10+dfsg-6.7
ii  libxslt1.1  1.1.34-4
ii  python3     3.9.2-3

Versions of packages python3-lxml recommends:
ii  python3-bs4       4.9.3-1
ii  python3-html5lib  1.1-3

Versions of packages python3-lxml suggests:
pn  python-lxml-doc   <none>
pn  python3-lxml-dbg  <none>

-- no debconf information
from lxml.builder import E
from lxml.etree import tostring

RECORD = E.record
CONTROLFIELD = E.controlfield
DATAFIELD = E.datafield
SUBFIELD = E.subfield

INPUT_DATA = {
    "520": [
        {
            "9": "APS",
            "a": 'The first measurement of the dependence of <math 
display="inline"><mrow><mi>γ</mi><mi>γ</mi><mo 
stretchy="false">→</mo><msup><mrow><mi>μ</mi></mrow><mrow><mo>+</mo></mrow></msup><msup><mrow><mi>μ</mi></mrow><mrow><mo>−</mo></mrow></msup></mrow></math>
 production on the multiplicity of neutrons emitted very close to the beam 
direction in ultraperipheral heavy ion collisions is reported. Data for 
lead-lead interactions at <math 
display="inline"><mrow><msqrt><mrow><msub><mrow><mi>s</mi></mrow><mrow><mi>N</mi><mi>N</mi></mrow></msub></mrow></msqrt><mo>=</mo><mn>5.02</mn><mtext>\u2009</mtext><mtext>\u2009</mtext><mi>TeV</mi></mrow></math>,
 with an integrated luminosity of approximately <math 
display="inline"><mrow><mn>1.5</mn><mtext>\u2009</mtext><mtext>\u2009</mtext><msup><mrow><mi>nb</mi></mrow><mrow><mo>-</mo><mn>1</mn></mrow></msup></mrow></math>,
 are collected using the CMS detector at the LHC. The azimuthal correlations 
between the two muons in the invariant mass region <math 
display="inline"><mrow><mn>8</mn><mo>&lt;</mo><msub><mrow><mi>m</mi></mrow><mrow><mi>μ</mi><mi>μ</mi></mrow></msub><mo>&lt;</mo><mn>60</mn><mtext>\u2009</mtext><mtext>\u2009</mtext><mi>GeV</mi></mrow></math>
 are extracted for events including 0, 1, or at least 2 neutrons detected in 
the forward pseudorapidity range <math display="inline"><mrow><mrow><mo 
stretchy="false">|</mo><mi>η</mi><mo 
stretchy="false">|</mo></mrow><mo>&gt;</mo><mn>8.3</mn></mrow></math>. The 
back-to-back correlation structure from leading-order photon-photon scattering 
is found to be significantly broader for events with a larger number of emitted 
neutrons from each nucleus, corresponding to interactions with a smaller impact 
parameter. This observation provides a data-driven demonstration that the 
average transverse momentum of photons emitted from relativistic heavy ions has 
an impact parameter dependence. These results provide new constraints on models 
of photon-induced interactions in ultraperipheral collisions. They also provide 
a baseline to search for possible final-state effects on lepton pairs caused by 
traversing a quark-gluon plasma produced in hadronic heavy ion collisions.',
        },
        {
            "9": "arXiv",
            "a": "The first measurement of the dependence of 
$\\gamma\\gamma$$\\to$$\\mu^{+}\\mu^{-}$ production on the multiplicity of 
neutrons emitted very close to the beam direction in ultraperipheral heavy ion 
collisions is reported. Data for lead-lead interactions at 
$\\sqrt{s_\\mathrm{NN}} =$ 5.02 TeV, with an integrated luminosity of 
approximately 1.5 nb$^{-1}$, were collected using the CMS detector at the LHC. 
The azimuthal correlations between the two muons in the invariant mass region 8 
$\\lt$$m_{\\mu\\mu}$$\\lt$ 60 GeV are extracted for events including 0, 1, or 
at least 2 neutrons detected in the forward pseudorapidity range 
$|\\eta|$$\\gt$ 8.3. The back-to-back correlation structure from leading-order 
photon-photon scattering is found to be significantly broader for events with a 
larger number of emitted neutrons from each nucleus, corresponding to 
interactions with a smaller impact parameter. This observation provides a 
data-driven demonstration that the average transverse momentum of photons 
emitted from relativistic heavy ions has an impact parameter dependence. These 
results provide new constraints on models of photon-induced interactions in 
ultraperipheral collisions. They also provide a baseline to search for possible 
final-state effects on lepton pairs caused by traversing a quark-gluon plasma 
produced in hadronic heavy ion collisions.",
        },
    ]
}

record = RECORD()
for tag, values in sorted(INPUT_DATA.items()):
    for value in values:
        datafield = DATAFIELD({"tag": tag, "ind1": " ", "ind2": " "})
        for code, el in sorted(value.items()):
            datafield.append(
                SUBFIELD(el, {"code": code})
            )
        record.append(datafield)

utf8_bad = tostring(record, encoding="utf8")
utf8_good = tostring(record, encoding="utf-8")

print("Bad:", utf8_bad, "\n", "Good:", utf8_good, sep="\n")

Reply via email to