Re: Screen scraper to get all 'a title' elements
Hi
It seems that links on that Wikipedia page follow the structure :
You could extract a list of link titles with something like :
re.findall( r'\]+title="(.+?)"', html )
HTH,
-Grobu-
On 25/11/15 21:55, MRAB wrote:
On 2015-11-25 20:42, ryguy7272 wrote:
Hello experts. I'm looking at this url:
https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names
I'm trying to figure out how to list all 'a title' elements. For
instance, I see the following:
Accident
Ala-Lemu
Alert
Apocalypse
Peaks
So, I tried putting a script together to get 'title'. Here's my attempt.
import requests
import sys
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names";
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('title'):
print(link)
All that does is get the title of the page. I tried to get the links
from that url, with this script.
A 'title' element has the form "". What you should be looking
for are 'a' elements, those of the form "".
import urllib2
import re
#connect to a URL
website =
urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')
#read html code
html = website.read()
#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)
print links
That doesn't work wither. Basically, I'd like to see this.
Accident
Ala-Lemu
Alert
Apocalypse Peaks
Athol
Å
Barbecue
Båstad
Bastardstown
Batman
Bathmen (Battem), Netherlands
...
Worms
Yell
Zigzag
Zzyzx
How can I do that?
Thanks all!!
--
https://mail.python.org/mailman/listinfo/python-list
Re: Screen scraper to get all 'a title' elements
On 25/11/15 23:48, ryguy7272 wrote:
re.findall( r'\]+title="(.+?)"', html )
[ ... ]
Thanks!! Is that regex? Can you explain exactly what it is doing?
Also, it seems to pick up a lot more than just the list I wanted, but that's
ok, I can see why it does that.
Can you just please explain what it's doing???
Yes it's a regular expression. Because RegEx's use the backslash as an
escape character, it is advisable to use the "raw string" prefix (r
before single/double/triple quote. To illustrate it with an example :
>>> print "1\n2"
1
2
>>> print r"1\n2"
1\n2
As the backslash escape character is "neutralized" by the raw string,
you can use the usual RegEx syntax at leisure :
\]+title="(.+?)"
\< was a mistake on my part, a single < is perfectly enough
[^>] is a class definition, and the caret (^) character indicates
negation. Thus it means : any character other than >
+ incidates repetition : one or more of the previous element
. will match just anything
.+" is a _greedy_ pattern that would match anything until it encountered
a double quote
The problem with a greedy pattern is that it doesn't stop at the first
match. To illustrate :
>>> a = re.search( r'".+"', 'title="this is a test" class="test"' )
>>> a.group()
'"this is a test" class="test"'
It matches the first quote up to the last one.
On the other hand, you can use the "?" modifier to specify a non-greedy
pattern :
>>> b = re.search( r'".+?"', 'title="this is a test" class="test"' )
'"this is a test"'
It matches the first quote and stops looking for further matches after
the second quote.
Finally, the parentheses are used to indicate a capture group :
>>> a = re.search( r'"this (is) a (.+?)"', 'title="this is a test"
class="test"' )
>>> a.groups()
('is', 'test')
You can find detailed explanations about Python regular expressions at
this page : https://docs.python.org/2/howto/regex.html
HTH,
-Grobu-
--
https://mail.python.org/mailman/listinfo/python-list
Re: Screen scraper to get all 'a title' elements
On 26/11/15 00:06, Chris Angelico wrote: On Thu, Nov 26, 2015 at 9:48 AM, ryguy7272 wrote: Thanks!! Is that regex? Can you explain exactly what it is doing? Also, it seems to pick up a lot more than just the list I wanted, but that's ok, I can see why it does that. Can you just please explain what it's doing??? It's a trap! Don't use a regex to parse HTML, unless you're deliberately trying to entice young and innocent programmers to the dark side. ChrisA Sorry, I wasn't aware of regex being on the dark side :-) Now that you mention it, I suppose that their being complex and error-inducing could lead to broken code all too easily when there is a reliable, ready-made solution like BeautifulSoup. -- https://mail.python.org/mailman/listinfo/python-list
Re: Screen scraper to get all 'a title' elements
Chris, Marko, thank you both for your links and explanations! -- https://mail.python.org/mailman/listinfo/python-list
Re: Find relative url in mixed text/html
On 28/11/15 03:35, Rob Hills wrote: Hi, For my sins I am migrating a volunteer association forum from one platform (WebWiz) to another (phpBB). I am (I hope) 95% of the way through the process. Posts to our original forum comprise a soup of plain text, HTML and BBCodes. A post */may/* include links done as either standard HTML links ( http://blah.blah.com.au or even just www.blah.blah.com.au ). In my conversion process, I am trying to identify cross-links (links from one post on the forum to another) so I can convert them to links that will work in the new forum. My current code uses a Regular Expression (yes, I read the recent posts on this forum about regex and HTML!) to pull out "absolute" links ( starting with http:// ) and then I use Python to identify and convert the specific links I am interested in. However, the forum also contains "cross-links" done using relative links and I'm unsure how best to proceed with that one. Googling so far has not been helpful, but that might be me using the wrong search terms. Some examples of what I am talking about are: Post fragment containing an "Absolute" cross-link: ive made a new thread: http://www.aeva.asn.au/forums/forum_posts.asp?TID=316&PID=1958#1958 converts to: ive made a new thread: /viewtopic.php?t=316&p=1958#1958 Post fragment containing a "Relative" cross-link: Battery Management SystemVeroboard prototype Needs converting to: Battery Management SystemVeroboard prototype So, my question is: What is the best way to extract a list of "relative links" from mixed text/html that I can then walk through to identify the specific ones I want to convert? Note, in the beginning of this project, I looked at using "Beautiful Soup" but my reading and limited testing lead me to believe that it is designed for well-formed HTML/XML and therefore was unsuitable for the text/html soup I have. If that belief is incorrect, I'd be grateful for general tips about using Beautiful Soup in this scenario... TIA, Hi Rob Is it safe to assume that all the relative (cross) links take one of the following forms? : http://www.aeva.asn.au/forums/forum_posts.asp www.aeva.asn.au/forums/forum_posts.asp /forums/forum_posts.asp /forum_posts.asp (are you really sure about this one?) If so, and if your goal boils down to converting all instances of old style URLs to new style ones regardless of the context where they appear, why would a regex fail to meet your needs? -- https://mail.python.org/mailman/listinfo/python-list
Re: static variables
Perhaps you could use a parameter's default value to implement your static variable? Like : # - >>> def test(arg=[0]): ... print arg[0] ... arg[0] += 1 ... >>> test() 0 >>> test() 1 # - -- https://mail.python.org/mailman/listinfo/python-list
Re: filter a list of strings
On 03/12/15 02:15, [email protected] wrote: I would like to know how this could be done more elegant/pythonic. I have a big list (over 10.000 items) with strings (each 100 to 300 chars long) and want to filter them. list = . for item in list[:]: if 'Banana' in item: list.remove(item) if 'Car' in item: list.remove(item) There are a lot of more conditions of course. This is just example code. It doesn't look nice to me. To much redundance. btw: Is it correct to iterate over a copy (list[:]) of that string list and not the original one? No idea how 'Pythonic' this would be considered, but you could use a combination of filter() with a regular expression : # -- import re list = ... pattern = re.compile( r'banana|car', re.I ) filtered_list = filter( lambda line: not pattern.search(line), list ) # -- HTH -- https://mail.python.org/mailman/listinfo/python-list
Re: How to union nested Sets / A single set from nested sets?
On 04/01/16 03:40, mviljamaa wrote: I'm forming sets by set.adding to sets and this leads to sets such as: Set([ImmutableSet(['a', ImmutableSet(['a'])]), ImmutableSet(['b', 'c'])]) Is there way union these to a single set, i.e. get Set(['a', 'b', 'c']) ? There's a built-in "union" method for sets : >>> a = set( ['a', 'b'] ) >>> b = set( ['c', 'd'] ) >>> a.union(b) set(['a', 'c', 'b', 'd']) HTH -- https://mail.python.org/mailman/listinfo/python-list
Re: Single format descriptor for list
On 20/01/16 10:35, Paul Appleby wrote:
In BASH, I can have a single format descriptor for a list:
$ a='4 5 6 7'
$ printf "%sth\n" $a
4th
5th
6th
7th
Is this not possible in Python? Using "join" rather than "format" still
doesn't quite do the job:
a = range(4, 8)
print ('th\n'.join(map(str,a)))
4th
5th
6th
7
Is there an elegant way to print-format an arbitrary length list?
In Python 2.7 :
#
a = '4 5 6 7'
for item in a.split():
print '%sth' % item
#
or
#
a = '4 5 6 7'.split()
print ('{}th\n' * len(a)).format(*a),
#
or
#
a = '4 5 6 7'
print ''.join( map( '{}th\n'.format, a.split() ) ),
#
--
https://mail.python.org/mailman/listinfo/python-list
Re: How to simulate C style integer division?
On 21/01/16 09:39, Shiyao Ma wrote: Hi, I wanna simulate C style integer division in Python3. So far what I've got is: # a, b = 3, 4 import math result = float(a) / b if result > 0: result = math.floor(result) else: result = math.ceil(result) I found it's too laborious. Any quick way? math.trunc( float(a) / b ) -- https://mail.python.org/mailman/listinfo/python-list
Re: How to simulate C style integer division?
On 22/01/16 04:48, Steven D'Aprano wrote: [ ... ] math.trunc( float(a) / b ) That fails for sufficiently big numbers: py> a = 3**1000 * 2 py> b = 3**1000 py> float(a)/b # Exact answer should be 2 Traceback (most recent call last): File "", line 1, in OverflowError: long int too large to convert to float Note that Python gets the integer division correct: py> a//b 2L And even gets true division correct: py> from __future__ import division py> a/b 2.0 so it's just the intermediate conversion to float that fails. Thanks! I did see recommandations to avoid floats throughout the thread, but didn't understand why. Following code should be exempt from such shortcomings : def intdiv(a, b): return (a - (a % (-b if a < 0 else b))) / b -- https://mail.python.org/mailman/listinfo/python-list
Re: How to simulate C style integer division?
def intdiv(a, b): return (a - (a % (-b if a < 0 else b))) / b Duh ... Got confused with modulos (again). def intdiv(a, b): return (a - (a % (-abs(b) if a < 0 else abs(b / b -- https://mail.python.org/mailman/listinfo/python-list
Re: How to simulate C style integer division?
On 23/01/16 16:07, Jussi Piitulainen wrote: Grobu writes: def intdiv(a, b): return (a - (a % (-b if a < 0 else b))) / b Duh ... Got confused with modulos (again). def intdiv(a, b): return (a - (a % (-abs(b) if a < 0 else abs(b / b You should use // here to get an exact integer result. You're totally right, thanks! It isn't an issue for Python 2.x's "classic division", but it becomes one with Python 3's "True division", and '//' allows to play it safe all along. -- https://mail.python.org/mailman/listinfo/python-list
Re: python-2.7.3 vs python-3.2.3
On 26/01/16 13:26, Gene Heskett wrote: Greetings; I have need of using a script written for python3, but the default python on wheezy is 2.7.3. I see in the wheezy repos that 3.2.3-6 is available. Can/will they co-exist peacefully? Thank you. Cheers, Gene Heskett On Debian Jessie : $ ls -F /usr/bin/pyth* /usr/bin/python@ /usr/bin/python3@ /usr/bin/python3m@ /usr/bin/python2@/usr/bin/python3.4* /usr/bin/python2.7* /usr/bin/python3.4m* ... so it seems they can co-exist peacefully. -- https://mail.python.org/mailman/listinfo/python-list
Re: show instant data on webpage
On 26/01/16 17:10, mustang wrote:
I've built a sensor to measure some values.
I would like to show it on a web page with python.
This is an extract of the code:
file = open("myData.dat", "w")
while True:
temp = sensor.readTempC()
riga = "%f\n" % temp
file.write(riga)
time.sleep(1.0)
file.close()
Until this all ok.
Then in PHP I read the file and show it on internet. It works ok but...
First problem. I've to stop the streaming with CTRl-C to load the PHP
file because if I try to read during measurement I cannot show anything
(I think because the file is in using).
How can I show time by time in a webpage the output?
Perhaps you could use a Javascript plotting library (like the one at
flotcharts.org), and use Python only as a CGI agent to serve plotting
data when requested by Javascript, at regular intervals?
--
https://mail.python.org/mailman/listinfo/python-list
Re: [STORY-TIME] THE BDFL AND HIS PYTHON PETTING ZOO
On 03/02/16 04:26, Rick Johnson wrote: [ ... ] And many children came from the far and wide, and they would pet his snake, and they would play with his snake Didn't know Pedobear had a biographer. -- https://mail.python.org/mailman/listinfo/python-list
Re: Set Operations on Dicts
You can use dictionary comprehension :
Say :
dict1 = {'a': 123, 'b': 456}
set1 = {'a'}
intersection :
>>> { key:dict1[key] for key in dict1 if key in set1 }
{'a': 123}
difference :
>>> { key:dict1[key] for key in dict1 if not key in set1 }
{'b': 456}
--
https://mail.python.org/mailman/listinfo/python-list
Re: Set Operations on Dicts
On 08/02/16 17:12, Ian Kelly wrote:
dict does already expose set-like views. How about:
{k: d[k] for k in d.keys() & s} # d & s
{k: d[k] for k in d.keys() - s} # d - s
Interesting. But seemingly only applies to Python 3.
--
https://mail.python.org/mailman/listinfo/python-list
Re: What is the right way to import a package?
On 14/11/15 21:00, fl wrote: Hi, I want to use a code snippet found on-line. It has such content: from numpy import * dt = 0.1 # Initialization of state matrices X = array([[0.0], [0.0], [0.1], [0.1]]) # Measurement matrices Y = array([[X[0,0] + abs(randn(1)[0])], [X[1,0] + abs(randn(1)[0])]]) When the above content is inside a .py document and running, there will be an error: ---> 15 Y = array([[X[0,0] + abs(randn(1)[0])], [X[1,0] + abs(randn(1)[0])]]) 16 #Y = ([[X[0,0]], [X[1,0] + 0]]) NameError: name 'randn' is not defined But when I run the above line by line at the console (Canopy), there will be no error for the above line. My question is: The import and the following are wrong. X = array([[0.0], [0.0], [0.1], [0.1]]) It should be: import numpy as np ... Y = np.array([[X[0,0] + abs(np.randn(1)[0])], [X[1,0] + abs(np.randn(1)[0])]]) This looks like the code I once saw. But the file when running has such error: ---> 15 Y = np.array([[X[0,0] + abs(np.randn(1)[0])], [X[1,0] + abs(np.randn(1)[0])]]) AttributeError: 'module' object has no attribute 'randn' When it is run line by line at the console, it has the same error. It is strange that the same content has errors depends on inside a file, or at CLI console. What is missing I don't realize? Thanks, You can try : from numpy import * from numpy.random import * HTH, - Grobu - -- https://mail.python.org/mailman/listinfo/python-list
