[Tutor] xpath - html entities issue -- &

bruce Tue, 04 Oct 2016 07:05:32 -0700

Hi.

Just realized I might have a prob with testing a crawl.


I get a page of data via a basic curl. The returned data is
html/charset-utf-8.

I did a quick replace ('&amp;','&') and it replaced the '&amp;' as desired.
So the content only had '&' in it..

I then did a parseString/xpath to extract what I wanted, and realized I
have '&amp;' as representative of the '&' in the returned xpath content.

My issue, is there a way/method/etc, to only return the actual char, not
the html entiy (&amp;)

I can provide a more comprehensive chunk of code, but minimized the post to
get to the heart of the issue. Also, I'd prefer not to use a sep parse lib.

------------------------------------
code chunk

import libxml2dom

q1=libxml2dom

s2= q1.parseString(a.toString().strip(), html=1)
tt=s2.xpath(tpath)

tt=tt[0].toString().strip()
print "tit "+tt

-------------------------------------


the content of a.toString() (shortened)
.
.
.
                 <div class="material-group-overview">
                    <div class="icon-book"></div>
                    <h3 class="material-group-title">Organization
Development & Change
                        <span>Edition: 10th</span>
                    </h3>
                    <a class="material-group-toggle-top-link"
id="toggle-top_1" href="javascript:void(0);" title="Click to hide options
for material">

.
.
.

the xpath results are

                <div class="material-group-overview">
                    <div class="icon-book"></div>
                    <h3 class="material-group-title">Organization
Development &amp; Change
                        <span>Edition: 10th</span>
                    </h3>


As you can see.. in the results of the xpath (toString())
 the & --> &amp;

I'm wondering if there's a process that can be used within the toString()
or do you really have to wrap each xpath/toString with a unescape() kind of
process to convert htmlentities to the requisite chars.

Thanks
_______________________________________________
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
https://mail.python.org/mailman/listinfo/tutor

[Tutor] xpath - html entities issue -- &

Reply via email to