Re: Python scripts in .exe form

2022-08-20 Thread Jim Schwartz
What method did you use to create the exe file from your python scripts?  If it 
was pyinstaller, then it puts the compiled versions of these python scripts in 
a windows temp folder when you run them. You’ll be able to get the scripts from 
there. 

Sent from my iPhone

> On Aug 19, 2022, at 9:51 PM, Mona Lee  wrote:
> 
> I'm pretty new to Python, and I had to do some tinkering because I was 
> running into issues with trying to download a package from PIP and must've 
> caused some issues in my program that I don't know how to fix
> 
> 1. It started when I was unable to update PIP to the newest version because 
> of some "Unknown error" (VS Code error - unable to read file - 
> (Unknown(FileSystemError) where I believe some file was not saved in the 
> right location? 
> 
> 2. In my command line on VS code there used to be the prefix that looked 
> something like "PS C:\Users\[name]>" but now it is "PS 
> C:\Users\[name]\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\Scripts>
> 
> From there I redownloaded my VS code but still have the 2) issue.
> 
> also, my scripts are now in the .exe form that I cannot access because "it is 
> either binary or in a unsupported text encoding" I've tried to extract it 
> back into the .py form using pyinstxtractor and decompile-python3 but I can't 
> successfully work these.
> 
> 3. also wanted to mention that some of my old Python programs are missing.
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Jon Ribbens via Python-list
On 2022-08-19, Chris Angelico  wrote:
> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
>
> Using the Alice example from the BS4 docs:
>
 html_doc = """The Dormouse's story
>
>The Dormouse's story
>
>Once upon a time there were three little sisters; and
> their names were
>http://example.com/elsie"; class="sister" id="link1">Elsie,
>http://example.com/lacie"; class="sister" id="link2">Lacie and
>http://example.com/tillie"; class="sister" id="link3">Tillie;
> and they lived at the bottom of a well.
>
>...
> """
 print(soup)
>The Dormouse's story
>
>The Dormouse's story
>Once upon a time there were three little sisters; and
> their names were
>http://example.com/elsie"; id="link1">Elsie,
>http://example.com/lacie"; id="link2">Lacie and
>http://example.com/tillie"; id="link3">Tillie;
> and they lived at the bottom of a well.
>...
>

>
> Note two distinct changes: firstly, whitespace has been removed, and
> secondly, attributes are reordered (I think alphabetically). There are
> other canonicalizations being done, too.
>
> I'm trying to make some automated changes to a huge number of HTML
> files, with minimal diffs so they're easy to validate. That means that
> spurious changes like these are very much unwanted. Is there a way to
> get BS4 to reconstruct the original precisely?
>
> The mutation itself would be things like finding an anchor tag and
> changing its href attribute. Fairly simple changes, but might alter
> the length of the file (eg changing "http://example.com/"; into
> "https://example.com/";). I'd like to do them intelligently rather than
> falling back on element.sourceline and element.sourcepos, but worst
> case, that's what I'll have to do (which would be fiddly).

I'm tempting the Wrath of Zalgo by saying it, but ... regexp?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Chris Angelico
On Sun, 21 Aug 2022 at 03:27, Stefan Ram  wrote:
>
> [email protected] writes:
> >textual representations.  That way, the following two elements are the
> >same (and similar with a collection of sub-elements in a different order
> >in another document):
>
>   The /elements/ differ. They have the /same/ infoset.

That's the bit that's hard to prove.

>   The OP could edit the files with regexps to create a new version.

To you and Jon, who also suggested this: how would that be beneficial?
With Beautiful Soup, I have the line number and position within the
line where the tag starts; what does a regex give me that I don't have
that way?

>   Soup := BeautifulSoup.
>
>   Then have Soup read both the new version and the old version.
>
>   Then have Soup also edit the old version read in, the same way as
>   the regexps did and verify that now the old version edited by
>   Soup and the new version created using regexps agree.
>
>   Or just use Soup as a tool to show the diffs for visual inspection
>   by having Soup read both the original version and the version edited
>   with regexps. Now both are normalized by Soup and Soup can show the
>   diffs (such a diff feature might not be a part of Soup, but it should
>   not be too much effort to write one using Soup).
>

But as mentioned, the entire problem *is* the normalization, as I have
no proof that it has had no impact on the rendering of the page.
Comparing two normalized versions is no better than my original option
1, whereby I simply ignore the normalization and write out the
reconstructed content.

It's easy if you know for certain that the page is well-formed. Much
harder if you do not - or, as in some cases, if you know the page is
badly-formed.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Python scripts in .exe form

2022-08-20 Thread Barry


> On 20 Aug 2022, at 14:28, Jim Schwartz  wrote:
> 
> What method did you use to create the exe file from your python scripts?  If 
> it was pyinstaller, then it puts the compiled versions of these python 
> scripts in a windows temp folder when you run them. You’ll be able to get the 
> scripts from there. 

The temp file is only for .dll files the python code is in a data block that is 
appended to the .exe stub.
There are tools that can grab the appended dat and dump it out.
Or atleast should be.

Barry

> 
> Sent from my iPhone
> 
>> On Aug 19, 2022, at 9:51 PM, Mona Lee  wrote:
>> 
>> I'm pretty new to Python, and I had to do some tinkering because I was 
>> running into issues with trying to download a package from PIP and must've 
>> caused some issues in my program that I don't know how to fix
>> 
>> 1. It started when I was unable to update PIP to the newest version because 
>> of some "Unknown error" (VS Code error - unable to read file - 
>> (Unknown(FileSystemError) where I believe some file was not saved in the 
>> right location? 
>> 
>> 2. In my command line on VS code there used to be the prefix that looked 
>> something like "PS C:\Users\[name]>" but now it is "PS 
>> C:\Users\[name]\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.10_qbz5n2kfra8p0\LocalCache\local-packages\Python310\Scripts>
>> 
>> From there I redownloaded my VS code but still have the 2) issue.
>> 
>> also, my scripts are now in the .exe form that I cannot access because "it 
>> is either binary or in a unsupported text encoding" I've tried to extract it 
>> back into the .py form using pyinstxtractor and decompile-python3 but I 
>> can't successfully work these.
>> 
>> 3. also wanted to mention that some of my old Python programs are missing.
>> -- 
>> https://mail.python.org/mailman/listinfo/python-list
> 
> -- 
> https://mail.python.org/mailman/listinfo/python-list

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Jon Ribbens via Python-list
On 2022-08-20, Chris Angelico  wrote:
> On Sun, 21 Aug 2022 at 03:27, Stefan Ram  wrote:
>> [email protected] writes:
>> >textual representations.  That way, the following two elements are the
>> >same (and similar with a collection of sub-elements in a different order
>> >in another document):
>>
>>   The /elements/ differ. They have the /same/ infoset.
>
> That's the bit that's hard to prove.
>
>>   The OP could edit the files with regexps to create a new version.
>
> To you and Jon, who also suggested this: how would that be beneficial?
> With Beautiful Soup, I have the line number and position within the
> line where the tag starts; what does a regex give me that I don't have
> that way?

You mean you could use BeautifulSoup to read the file and identify the
bits you want to change by line number and offset, and then you could
use that data to try and update the file, hoping like hell that your
definition of "line" and "offset" are identical to BeautifulSoup's
and that you don't mess up later changes when you do earlier ones (you
could do them in reverse order of line and offset I suppose) and
probably resorting to regexps anyway in order to find the part of the
tag you want to change ...

... or you could avoid all that faff and just do re.sub()?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Jon Ribbens via Python-list
On 2022-08-20, Stefan Ram  wrote:
> Jon Ribbens  writes:
>>... or you could avoid all that faff and just do re.sub()?
>
> import bs4
> import re
>
> source = ''
>
> # Use Python to change the source, keeping the order of attributes.
>
> result = re.sub( r'href\s*=\s*"http"', r'href="https"', source )
> result = re.sub( r"href\s*=\s*'http'", r"href='https'", result )

You could go a bit harder with the regexp of course, e.g.:

  result = re.sub(
  r"""(<\s*a\s+[^>]*href\s*=\s*)(['"])\s*OLD\s*\2""",
  r"\1\2NEW\2",
  source,
  flags=re.IGNORECASE
  )

> # Now use BeautifulSoup only for the verification of the result.
>
> reference = bs4.BeautifulSoup( source, features="html.parser" )
> for a in reference.find_all( "a" ):
> if a[ 'href' ]== 'http': a[ 'href' ]='https'
>
> print( bs4.BeautifulSoup( result, features="html.parser" )== reference )

Hmm, yes that seems like a pretty good idea.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread dn
On 20/08/2022 12.38, Chris Angelico wrote:
> On Sat, 20 Aug 2022 at 10:19, dn  wrote:
>> On 20/08/2022 09.01, Chris Angelico wrote:
>>> On Sat, 20 Aug 2022 at 05:12, Barry  wrote:
> On 19 Aug 2022, at 19:33, Chris Angelico  wrote:
>
> What's the best way to precisely reconstruct an HTML file after
> parsing it with BeautifulSoup?
...

>>> well. Thanks for trying, anyhow.
>>>
>>> So I'm left with a few options:
>>>
>>> 1) Give up on validation, give up on verification, and just run this
>>> thing on the production site with my fingers crossed
>>> 2) Instead of doing an intelligent reconstruction, just str.replace()
>>> one URL with another within the file
>>> 3) Split the file into lines, find the Nth line (elem.sourceline) and
>>> str.replace that line only
>>> 4) Attempt to use elem.sourceline and elem.sourcepos to find the start
>>> of the tag, manually find the end, and replace one tag with the
>>> reconstructed form.
>>>
>>> I'm inclined to the first option, honestly. The others just seem like
>>> hard work, and I became a programmer so I could be lazy...
>> +1 - but I've noticed that sometimes I have to work quite hard to be
>> this lazy!
> 
> Yeah, that's very true...
> 
>> Am assuming that http -> https is not the only 'change' (if it were,
>> you'd just do that without BS). How many such changes are planned/need
>> checking? Care to list them?

This project has many of the same 'smells' as a database-harmonisation
effort. Particularly one where 'the previous guy' used to use field-X
for certain data, but his replacement decided that field-Y 'sounded
better' (or some such user-logic). Arrrggg!

If you like head-aches, and users coming to you with ifs-buts-and-maybes
AFTER you've 'done stuff', this is your sort of project!


> Assumption is correct. The changes are more of the form "find all the
> problems, add to the list of fixes, try to minimize the ones that need
> to be done manually". So far, what I have is:

Having taken the trouble to identify this list of improvements and given
the determination to verify each, consider working through one item at a
time, rather than in a single pass. This will enable individual logging
of changes, a manual check of each alteration, and the ability to
choose/tailor the best tool for that specific task.

In fact, depending upon frequency, making the changes manually (and with
improved confidence in the result).

The presence of (or allusion to) the word "some" in this list-items is
'the killer'. Automation doesn't like 'some' (cf "all") unless the
criteria can be clearly and unambiguously defined. Ouch!

(I don't think you need to be told any of this, but hey: dreams are free!)


> 1) A bunch of http -> https, but not all of them - only domains where
> I've confirmed that it's valid

The search-criteria is the list of valid domains, rather than the
"http/https" which is likely the first focus.


> 2) Some absolute to relative conversions:
> https://www.gsarchive.net/whowaswho/index.htm should be referred to as
> /whowaswho/index.htm instead

Similarly, if you have a list of these.


> 3) A few outdated URLs for which we know the replacement, eg
> http://www.cris.com/~oakapple/gasdisc/ to
> http://www.gasdisc.oakapplepress.com/ (this one can't go on
> HTTPS, which is one reason I can't shortcut that)

Again.


> 4) Some internal broken links where the path is wrong - anything that
> resolves to /books/ but can't be found might be better
> rewritten as /html/perf_grps/websites/ if the file can be
> found there

Again.


> 5) Any external link that yields a permanent redirect should, to save
> clientside requests, get replaced by the destination. We have some
> Creative Commons badges that have moved to new URLs.

Do you have these as a list, or are you intending the automated-method
to auto-magically follow the link to determine any need for action?


> And there'll be other fixes to be done too. So it's a bit complicated,
> and no simple solution is really sufficient. At the very very least, I
> *need* to properly parse with BS4; the only question is whether I
> reconstruct from the parse tree, or go back to the raw file and try to
> edit it there.

At least the diffs would give you something to work-from, but it's a bit
like git-diffs claiming a 'change' when the only difference is that my
IDE strips blanks from the ends of code-lines, or some-such silliness.

Which brings me to ask: why "*need* to properly parse with BS4"?

What about selective use of tools, previously-mentioned in this thread?

Is Selenium worthy of consideration?

I'm assuming you've already been using a link-checker utility to locate
the links which need to be changed. They can be used in QA-mode
after-the-fact too.


> For the record, I have very long-term plans to migrate parts of the
> site to Markdown, which would make a lot of things easier. But for
> now, I need to fix the existing problems in the existing HTML files,
> without doing gigantic wholesale layout ch

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Chris Angelico
On Sun, 21 Aug 2022 at 09:31, Jon Ribbens via Python-list
 wrote:
>
> On 2022-08-20, Chris Angelico  wrote:
> > On Sun, 21 Aug 2022 at 03:27, Stefan Ram  wrote:
> >> [email protected] writes:
> >> >textual representations.  That way, the following two elements are the
> >> >same (and similar with a collection of sub-elements in a different order
> >> >in another document):
> >>
> >>   The /elements/ differ. They have the /same/ infoset.
> >
> > That's the bit that's hard to prove.
> >
> >>   The OP could edit the files with regexps to create a new version.
> >
> > To you and Jon, who also suggested this: how would that be beneficial?
> > With Beautiful Soup, I have the line number and position within the
> > line where the tag starts; what does a regex give me that I don't have
> > that way?
>
> You mean you could use BeautifulSoup to read the file and identify the
> bits you want to change by line number and offset, and then you could
> use that data to try and update the file, hoping like hell that your
> definition of "line" and "offset" are identical to BeautifulSoup's
> and that you don't mess up later changes when you do earlier ones (you
> could do them in reverse order of line and offset I suppose) and
> probably resorting to regexps anyway in order to find the part of the
> tag you want to change ...
>
> ... or you could avoid all that faff and just do re.sub()?

Stefan answered in part, but I'll add that it is far FAR easier to do
the analysis with BS4 than regular expressions. I'm not sure what
"hoping like hell" is supposed to mean here, since the line and offset
have been 100% accurate in my experience; the only part I'm unsure
about is where the _end_ of the tag is (and maybe there's a way I can
use BS4 again to get that??).

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Chris Angelico
On Sun, 21 Aug 2022 at 09:48, dn  wrote:
>
> On 20/08/2022 12.38, Chris Angelico wrote:
> > On Sat, 20 Aug 2022 at 10:19, dn  wrote:
> >> On 20/08/2022 09.01, Chris Angelico wrote:
> >>> On Sat, 20 Aug 2022 at 05:12, Barry  wrote:
> > On 19 Aug 2022, at 19:33, Chris Angelico  wrote:
> >
> > What's the best way to precisely reconstruct an HTML file after
> > parsing it with BeautifulSoup?
> ...
>
> >>> well. Thanks for trying, anyhow.
> >>>
> >>> So I'm left with a few options:
> >>>
> >>> 1) Give up on validation, give up on verification, and just run this
> >>> thing on the production site with my fingers crossed
> >>> 2) Instead of doing an intelligent reconstruction, just str.replace()
> >>> one URL with another within the file
> >>> 3) Split the file into lines, find the Nth line (elem.sourceline) and
> >>> str.replace that line only
> >>> 4) Attempt to use elem.sourceline and elem.sourcepos to find the start
> >>> of the tag, manually find the end, and replace one tag with the
> >>> reconstructed form.
> >>>
> >>> I'm inclined to the first option, honestly. The others just seem like
> >>> hard work, and I became a programmer so I could be lazy...
> >> +1 - but I've noticed that sometimes I have to work quite hard to be
> >> this lazy!
> >
> > Yeah, that's very true...
> >
> >> Am assuming that http -> https is not the only 'change' (if it were,
> >> you'd just do that without BS). How many such changes are planned/need
> >> checking? Care to list them?
>
> This project has many of the same 'smells' as a database-harmonisation
> effort. Particularly one where 'the previous guy' used to use field-X
> for certain data, but his replacement decided that field-Y 'sounded
> better' (or some such user-logic). Arrrggg!
>
> If you like head-aches, and users coming to you with ifs-buts-and-maybes
> AFTER you've 'done stuff', this is your sort of project!

Well, I don't like headaches, but I do appreciate what the G&S Archive
has given me over the years, so I'm taking this on as a means of
giving back to the community.

> > Assumption is correct. The changes are more of the form "find all the
> > problems, add to the list of fixes, try to minimize the ones that need
> > to be done manually". So far, what I have is:
>
> Having taken the trouble to identify this list of improvements and given
> the determination to verify each, consider working through one item at a
> time, rather than in a single pass. This will enable individual logging
> of changes, a manual check of each alteration, and the ability to
> choose/tailor the best tool for that specific task.
>
> In fact, depending upon frequency, making the changes manually (and with
> improved confidence in the result).

Unfortunately the frequency is very high.

> The presence of (or allusion to) the word "some" in this list-items is
> 'the killer'. Automation doesn't like 'some' (cf "all") unless the
> criteria can be clearly and unambiguously defined. Ouch!
>
> (I don't think you need to be told any of this, but hey: dreams are free!)

Right; the criteria are quite well defined, but I omitted the details
for brevity.

> > 1) A bunch of http -> https, but not all of them - only domains where
> > I've confirmed that it's valid
>
> The search-criteria is the list of valid domains, rather than the
> "http/https" which is likely the first focus.

Yeah. I do a first pass to enumerate all domains that are ever linked
to with http:// URLs, and then I have a script that goes through and
checks to see if they redirect me to the same URL on the other
protocol, or other ways of checking. So yes, the list of valid domains
is part of the program's effective input.

> > 2) Some absolute to relative conversions:
> > https://www.gsarchive.net/whowaswho/index.htm should be referred to as
> > /whowaswho/index.htm instead
>
> Similarly, if you have a list of these.

It's more just the pattern "https://www.gsarchive.net/" and
"https://gsarchive.net/", and the corresponding "http://";
URLs, plus a few other malformed versions that are worth correcting
(if ever I find a link to "www.gsarchive.net/", it's almost
certainly missing its protocol).

> > 3) A few outdated URLs for which we know the replacement, eg
> > http://www.cris.com/~oakapple/gasdisc/ to
> > http://www.gasdisc.oakapplepress.com/ (this one can't go on
> > HTTPS, which is one reason I can't shortcut that)
>
> Again.

Same; although those are manually entered as patterns.

> > 4) Some internal broken links where the path is wrong - anything that
> > resolves to /books/ but can't be found might be better
> > rewritten as /html/perf_grps/websites/ if the file can be
> > found there
>
> Again.

The fixups are manually entered, but I also need to know about every
broken internal link so that I can look through them and figure out
what's wrong.

> > 5) Any external link that yields a permanent redirect should, to save
> > clientside requests, get replaced by the destination. We have some
> > Creative Comm

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread dn
On 21/08/2022 13.00, Chris Angelico wrote:
> On Sun, 21 Aug 2022 at 09:48, dn  wrote:
>> On 20/08/2022 12.38, Chris Angelico wrote:
>>> On Sat, 20 Aug 2022 at 10:19, dn  wrote:
 On 20/08/2022 09.01, Chris Angelico wrote:
> On Sat, 20 Aug 2022 at 05:12, Barry  wrote:
>>> On 19 Aug 2022, at 19:33, Chris Angelico  wrote:

> So I'm left with a few options:
>
> 1) Give up on validation, give up on verification, and just run this
> thing on the production site with my fingers crossed
> 2) Instead of doing an intelligent reconstruction, just str.replace()
> one URL with another within the file
> 3) Split the file into lines, find the Nth line (elem.sourceline) and
> str.replace that line only
> 4) Attempt to use elem.sourceline and elem.sourcepos to find the start
> of the tag, manually find the end, and replace one tag with the
> reconstructed form.
>
> I'm inclined to the first option, honestly. The others just seem like
> hard work, and I became a programmer so I could be lazy...
 +1 - but I've noticed that sometimes I have to work quite hard to be
 this lazy!
>>>
>>> Yeah, that's very true...
>>>
 Am assuming that http -> https is not the only 'change' (if it were,
 you'd just do that without BS). How many such changes are planned/need
 checking? Care to list them?
>>
>> This project has many of the same 'smells' as a database-harmonisation
>> effort. Particularly one where 'the previous guy' used to use field-X
>> for certain data, but his replacement decided that field-Y 'sounded
>> better' (or some such user-logic). Arrrggg!
>>
>> If you like head-aches, and users coming to you with ifs-buts-and-maybes
>> AFTER you've 'done stuff', this is your sort of project!
> 
> Well, I don't like headaches, but I do appreciate what the G&S Archive
> has given me over the years, so I'm taking this on as a means of
> giving back to the community.

This point will be picked-up in the conclusion. NB in the same way that
you want to 'give back', so also do others - even if in minor ways or
'when-relevant'!


>>> Assumption is correct. The changes are more of the form "find all the
>>> problems, add to the list of fixes, try to minimize the ones that need
>>> to be done manually". So far, what I have is:
>>
>> Having taken the trouble to identify this list of improvements and given
>> the determination to verify each, consider working through one item at a
>> time, rather than in a single pass. This will enable individual logging
>> of changes, a manual check of each alteration, and the ability to
>> choose/tailor the best tool for that specific task.
>>
>> In fact, depending upon frequency, making the changes manually (and with
>> improved confidence in the result).
> 
> Unfortunately the frequency is very high.

Screechingly so? Like you're singing Three Little Maids?


>> The presence of (or allusion to) the word "some" in this list-items is
>> 'the killer'. Automation doesn't like 'some' (cf "all") unless the
>> criteria can be clearly and unambiguously defined. Ouch!
>>
>> (I don't think you need to be told any of this, but hey: dreams are free!)
> 
> Right; the criteria are quite well defined, but I omitted the details
> for brevity.
> 
>>> 1) A bunch of http -> https, but not all of them - only domains where
>>> I've confirmed that it's valid
>>
>> The search-criteria is the list of valid domains, rather than the
>> "http/https" which is likely the first focus.
> 
> Yeah. I do a first pass to enumerate all domains that are ever linked
> to with http:// URLs, and then I have a script that goes through and
> checks to see if they redirect me to the same URL on the other
> protocol, or other ways of checking. So yes, the list of valid domains
> is part of the program's effective input.

Wow! Having got that far, you have achieved data-validity. Is there a
need to perform a before-after check or diff?

Perhaps start making the one-for-one replacements without further
anxiety. As long as there's no silly-mistake, eg failing to remove an
opening or closing angle-bracket; isn't that about all the checking needed?
(for this category of updates)


>>> 2) Some absolute to relative conversions:
>>> https://www.gsarchive.net/whowaswho/index.htm should be referred to as
>>> /whowaswho/index.htm instead
>>
>> Similarly, if you have a list of these.
> 
> It's more just the pattern "https://www.gsarchive.net/" and
> "https://gsarchive.net/", and the corresponding "http://";
> URLs, plus a few other malformed versions that are worth correcting
> (if ever I find a link to "www.gsarchive.net/", it's almost
> certainly missing its protocol).

Isn't the inspection tool (described elsewhere) reporting an HTML/editor
line number?

That being the case, won't a bit of Swiss-Army knife Python-string work
enable appropriate processing and re-writing - as well as providing the
means to statistically-sample for QA?


>>> 3) A few outdated URLs for which w

Re: Mutating an HTML file with BeautifulSoup

2022-08-20 Thread Chris Angelico
On Sun, 21 Aug 2022 at 13:41, dn  wrote:
>
> On 21/08/2022 13.00, Chris Angelico wrote:
> > Well, I don't like headaches, but I do appreciate what the G&S Archive
> > has given me over the years, so I'm taking this on as a means of
> > giving back to the community.
>
> This point will be picked-up in the conclusion. NB in the same way that
> you want to 'give back', so also do others - even if in minor ways or
> 'when-relevant'!

Very true.

> >> In fact, depending upon frequency, making the changes manually (and with
> >> improved confidence in the result).
> >
> > Unfortunately the frequency is very high.
>
> Screechingly so? Like you're singing Three Little Maids?

You don't want to hear me singing that although I do recall once
singing Lady Ella's part at a Qwert, to gales of laughter.

> > Yeah. I do a first pass to enumerate all domains that are ever linked
> > to with http:// URLs, and then I have a script that goes through and
> > checks to see if they redirect me to the same URL on the other
> > protocol, or other ways of checking. So yes, the list of valid domains
> > is part of the program's effective input.
>
> Wow! Having got that far, you have achieved data-validity. Is there a
> need to perform a before-after check or diff?

Yes, to ensure that nothing has changed that I *didn't* plan. The
planned changes aren't the problem here, I can verify those elsewhere.

> Perhaps start making the one-for-one replacements without further
> anxiety. As long as there's no silly-mistake, eg failing to remove an
> opening or closing angle-bracket; isn't that about all the checking needed?
> (for this category of updates)

Maybe, but probably not.

> BTW in talk of "line-number", you will have realised the need to re-run
> the identification of such after each of these steps - in case the 'new
> stuff' relating to earlier steps (assuming above became also a temporal
> sequence) is shorter/longer than the current HTML.

Yep, that's not usually a problem.

> >>> And there'll be other fixes to be done too. So it's a bit complicated,
> >>> and no simple solution is really sufficient. At the very very least, I
> >>> *need* to properly parse with BS4; the only question is whether I
> >>> reconstruct from the parse tree, or go back to the raw file and try to
> >>> edit it there.
> >>
> >> At least the diffs would give you something to work-from, but it's a bit
> >> like git-diffs claiming a 'change' when the only difference is that my
> >> IDE strips blanks from the ends of code-lines, or some-such silliness.
> >
> > Right; and the reconstructed version has a LOT of those unnecessary
> > changes. I'm seeing a lot of changes to whitespace. The only problem
> > is whether I can be confident that none of those changes could ever
> > matter.
>
> "White-space" has lesser-meaning in HTML - this is NOT Python! In HTML
> if I write "HTML  file" (with two spaces), the browser will shorten the
> display to a single space (hence some uses of   - non-broken
> space). Similarly, if attempt to use "\n" to start a new line of text...

Yes, whitespace has less meaning... except when it doesn't.

https://developer.mozilla.org/en-US/docs/Web/CSS/white-space

Text can become preformatted by the styling, and there could be
nothing whatsoever in the HTML page that shows this. I think most of
the HTML files in this site have been created by a WYSIWYG editor,
partly because of clues like a single bold space in a non-bold
sequence of text, and the styles aren't consistent everywhere. Given
that poetry comes up a lot on this site, I wouldn't put it past the
editor to have set a whitespace rule on something.

But I'm probably going to just ignore that and hope that any such
errors are less significant than the current set of broken links.

> Is there a danger of 'chasing your own tail', ie seeking a solution to a
> problem which really doesn't matter (particularly if we add the phrase:
> at the user-level)?

Unfortunately not. I now know of three categories of change that, in
theory, shouldn't affect anything: whitespace, order of attributes
("" becoming ""), and
self-closing tags. Whitespace probably won't matter, until it does.
Order of attributes is absolutely fine unless one of them is
miswritten and now we've lost a lot of information about how it ought
to have been written. And self-closing tags are probably
insignificant, but I don't know how browsers handle things like
"..." - and I wouldn't know whether the original
intention was for the second one to be a self-closing empty paragraph,
or a miswritten closing tag.

It's easy to say that these changes have no effect on well-formed
HTML. It's less easy to know what browsers will do with ill-formed
HTML.

> Agree with "properly parse". Question was an apparent dedication to BS4
> when there are other tools. Just checking you aren't wearing that type
> of 'blinders'.
> (didn't think so, but...)

No, but there's also always the option of some tool that I've never
heard of! The