Hi. Just realized I might have a prob with testing a crawl.
I get a page of data via a basic curl. The returned data is html/charset-utf-8. I did a quick replace ('&','&') and it replaced the '&' as desired. So the content only had '&' in it.. I then did a parseString/xpath to extract what I wanted, and realized I have '&' as representative of the '&' in the returned xpath content. My issue, is there a way/method/etc, to only return the actual char, not the html entiy (&) I can provide a more comprehensive chunk of code, but minimized the post to get to the heart of the issue. Also, I'd prefer not to use a sep parse lib. ------------------------------------ code chunk import libxml2dom q1=libxml2dom s2= q1.parseString(a.toString().strip(), html=1) tt=s2.xpath(tpath) tt=tt[0].toString().strip() print "tit "+tt ------------------------------------- the content of a.toString() (shortened) . . . <div class="material-group-overview"> <div class="icon-book"></div> <h3 class="material-group-title">Organization Development & Change <span>Edition: 10th</span> </h3> <a class="material-group-toggle-top-link" id="toggle-top_1" href="javascript:void(0);" title="Click to hide options for material"> . . . the xpath results are <div class="material-group-overview"> <div class="icon-book"></div> <h3 class="material-group-title">Organization Development & Change <span>Edition: 10th</span> </h3> As you can see.. in the results of the xpath (toString()) the & --> & I'm wondering if there's a process that can be used within the toString() or do you really have to wrap each xpath/toString with a unescape() kind of process to convert htmlentities to the requisite chars. Thanks _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor