Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Grobu

Hi

It seems that links on that Wikipedia page follow the structure :


You could extract a list of link titles with something like :
re.findall( r'\]+title="(.+?)"', html )

HTH,

-Grobu-


On 25/11/15 21:55, MRAB wrote:

On 2015-11-25 20:42, ryguy7272 wrote:

Hello experts.  I'm looking at this url:
https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names

I'm trying to figure out how to list all 'a title' elements.  For
instance, I see the following:
Accident
Ala-Lemu
Alert
Apocalypse
Peaks

So, I tried putting a script together to get 'title'.  Here's my attempt.

import requests
import sys
from bs4 import BeautifulSoup

url = "https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names";
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('title'):
 print(link)

All that does is get the title of the page.  I tried to get the links
from that url, with this script.


A 'title' element has the form "". What you should be looking
for are 'a' elements, those of the form "".


import urllib2
import re

#connect to a URL
website =
urllib2.urlopen('https://en.wikipedia.org/wiki/Wikipedia:Unusual_place_names')


#read html code
html = website.read()

#use re.findall to get all the links
links = re.findall('"((http|ftp)s?://.*?)"', html)

print links

That doesn't work wither.  Basically, I'd like to see this.

Accident
Ala-Lemu
Alert
Apocalypse Peaks
Athol
Å
Barbecue
Båstad
Bastardstown
Batman
Bathmen (Battem), Netherlands
...
Worms
Yell
Zigzag
Zzyzx

How can I do that?
Thanks all!!


--
https://mail.python.org/mailman/listinfo/python-list


Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Grobu


On 25/11/15 23:48, ryguy7272 wrote:

re.findall( r'\]+title="(.+?)"', html )

[ ... ]

Thanks!!  Is that regex?  Can you explain exactly what it is doing?
Also, it seems to pick up a lot more than just the list I wanted, but that's 
ok, I can see why it does that.

Can you just please explain what it's doing???



Yes it's a regular expression. Because RegEx's use the backslash as an 
escape character, it is advisable to use the "raw string" prefix (r 
before single/double/triple quote. To illustrate it with an example :

>>> print "1\n2"
1
2
>>> print r"1\n2"
1\n2
As the backslash escape character is "neutralized" by the raw string, 
you can use the usual RegEx syntax at leisure :


\]+title="(.+?)"

\<   was a mistake on my part, a single < is perfectly enough
[^>]	is a class definition, and the caret (^) character indicates 
negation. Thus it means : any character other than >

+   incidates repetition : one or more of the previous element
.   will match just anything
.+"	is a _greedy_ pattern that would match anything until it encountered 
a double quote


The problem with a greedy pattern is that it doesn't stop at the first 
match. To illustrate :

>>> a = re.search( r'".+"', 'title="this is a test" class="test"' )
>>> a.group()
'"this is a test" class="test"'

It matches the first quote up to the last one.
On the other hand, you can use the "?" modifier to specify a non-greedy 
pattern :


>>> b = re.search( r'".+?"', 'title="this is a test" class="test"' )
'"this is a test"'

It matches the first quote and stops looking for further matches after 
the second quote.


Finally, the parentheses are used to indicate a capture group :
>>> a = re.search( r'"this (is) a (.+?)"', 'title="this is a test" 
class="test"' )

>>> a.groups()
('is', 'test')


You can find detailed explanations about Python regular expressions at 
this page : https://docs.python.org/2/howto/regex.html


HTH,

-Grobu-

--
https://mail.python.org/mailman/listinfo/python-list


Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Grobu

On 26/11/15 00:06, Chris Angelico wrote:

On Thu, Nov 26, 2015 at 9:48 AM, ryguy7272  wrote:

Thanks!!  Is that regex?  Can you explain exactly what it is doing?
Also, it seems to pick up a lot more than just the list I wanted, but that's 
ok, I can see why it does that.

Can you just please explain what it's doing???


It's a trap!

Don't use a regex to parse HTML, unless you're deliberately trying to
entice young and innocent programmers to the dark side.

ChrisA



Sorry, I wasn't aware of regex being on the dark side :-)
Now that you mention it, I suppose that their being complex and 
error-inducing could lead to broken code all too easily when there is a 
reliable, ready-made solution like BeautifulSoup.


--
https://mail.python.org/mailman/listinfo/python-list


Re: Screen scraper to get all 'a title' elements

2015-11-25 Thread Grobu

Chris, Marko, thank you both for your links and explanations!
--
https://mail.python.org/mailman/listinfo/python-list


Re: Find relative url in mixed text/html

2015-11-27 Thread Grobu

On 28/11/15 03:35, Rob Hills wrote:

Hi,

For my sins I am migrating a volunteer association forum from one
platform (WebWiz) to another (phpBB).  I am (I hope) 95% of the way
through the process.

Posts to our original forum comprise a soup of plain text, HTML and
BBCodes.  A post */may/* include links done as either standard HTML
links ( http://blah.blah.com.au or even just
www.blah.blah.com.au ).

In my conversion process, I am trying to identify cross-links (links
from one post on the forum to another) so I can convert them to links
that will work in the new forum.

My current code uses a Regular Expression (yes, I read the recent posts
on this forum about regex and HTML!) to pull out "absolute" links (
starting with http:// ) and then I use Python to identify and convert
the specific links I am interested in.  However, the forum also contains
"cross-links" done using relative links and I'm unsure how best to
proceed with that one.  Googling so far has not been helpful, but that
might be me using the wrong search terms.

Some examples of what I am talking about are:

 Post fragment containing an "Absolute" cross-link:

 ive made a new thread:
 http://www.aeva.asn.au/forums/forum_posts.asp?TID=316&PID=1958#1958
 

 converts to:

 
 ive made a new thread:
 /viewtopic.php?t=316&p=1958#1958

 Post fragment containing a "Relative" cross-link:

 Battery Management SystemVeroboard prototype

 Needs converting to:

 Battery Management SystemVeroboard prototype

So, my question is:  What is the best way to extract a list of "relative
links" from mixed text/html that I can then walk through to identify the
specific ones I want to convert?

Note, in the beginning of this project, I looked at using "Beautiful
Soup" but my reading and limited testing lead me to believe that it is
designed for well-formed HTML/XML and therefore was unsuitable for the
text/html soup I have.  If that belief is incorrect, I'd be grateful for
general tips about using Beautiful Soup in this scenario...

TIA,



Hi Rob

Is it safe to assume that all the relative (cross) links take one of the 
following forms? :


http://www.aeva.asn.au/forums/forum_posts.asp
www.aeva.asn.au/forums/forum_posts.asp
/forums/forum_posts.asp
/forum_posts.asp (are you really sure about this one?)

If so, and if your goal boils down to converting all instances of old 
style URLs to new style ones regardless of the context where they 
appear, why would a regex fail to meet your needs?



--
https://mail.python.org/mailman/listinfo/python-list


Re: static variables

2015-12-01 Thread Grobu
Perhaps you could use a parameter's default value to implement your 
static variable?


Like :
# -
>>> def test(arg=[0]):
... print arg[0]
... arg[0] += 1
...
>>> test()
0
>>> test()
1
# -

--
https://mail.python.org/mailman/listinfo/python-list


Re: filter a list of strings

2015-12-03 Thread Grobu

On 03/12/15 02:15, [email protected] wrote:

I would like to know how this could be done more elegant/pythonic.

I have a big list (over 10.000 items) with strings (each 100 to 300
chars long) and want to filter them.

list = .

for item in list[:]:
   if 'Banana' in item:
  list.remove(item)
   if 'Car' in item:
  list.remove(item)

There are a lot of more conditions of course. This is just example code.
It doesn't look nice to me. To much redundance.

btw: Is it correct to iterate over a copy (list[:]) of that string list
and not the original one?



No idea how 'Pythonic' this would be considered, but you could use a 
combination of filter() with a regular expression :


# --
import re

list = ...

pattern = re.compile( r'banana|car', re.I )
filtered_list = filter( lambda line: not pattern.search(line), list )
# --

HTH

--
https://mail.python.org/mailman/listinfo/python-list


Re: How to union nested Sets / A single set from nested sets?

2016-01-06 Thread Grobu

On 04/01/16 03:40, mviljamaa wrote:

I'm forming sets by set.adding to sets and this leads to sets such as:

Set([ImmutableSet(['a', ImmutableSet(['a'])]), ImmutableSet(['b', 'c'])])

Is there way union these to a single set, i.e. get

Set(['a', 'b', 'c'])

?


There's a built-in "union" method for sets :

>>> a = set( ['a', 'b'] )
>>> b = set( ['c', 'd'] )
>>> a.union(b)
set(['a', 'c', 'b', 'd'])

HTH
--
https://mail.python.org/mailman/listinfo/python-list


Re: Single format descriptor for list

2016-01-20 Thread Grobu

On 20/01/16 10:35, Paul Appleby wrote:

In BASH, I can have a single format descriptor for a list:

$ a='4 5 6 7'
$ printf "%sth\n" $a
4th
5th
6th
7th

Is this not possible in Python? Using "join" rather than "format" still
doesn't quite do the job:


a = range(4, 8)
print ('th\n'.join(map(str,a)))

4th
5th
6th
7

Is there an elegant way to print-format an arbitrary length list?



In Python 2.7 :

# 
a = '4 5 6 7'
for item in a.split():
print '%sth' % item
# 

or

# 
a = '4 5 6 7'.split()
print ('{}th\n' * len(a)).format(*a),
# 

or

# 
a = '4 5 6 7'
print ''.join( map( '{}th\n'.format, a.split() ) ),
# 

--
https://mail.python.org/mailman/listinfo/python-list


Re: How to simulate C style integer division?

2016-01-21 Thread Grobu

On 21/01/16 09:39, Shiyao Ma wrote:

Hi,

I wanna simulate C style integer division in Python3.

So far what I've got is:
# a, b = 3, 4

import math
result = float(a) / b
if result > 0:
   result = math.floor(result)
else:
   result = math.ceil(result)


I found it's too laborious. Any quick way?



math.trunc( float(a) / b )

--
https://mail.python.org/mailman/listinfo/python-list


Re: How to simulate C style integer division?

2016-01-23 Thread Grobu

On 22/01/16 04:48, Steven D'Aprano wrote:
[ ... ]


math.trunc( float(a) / b )



That fails for sufficiently big numbers:


py> a = 3**1000 * 2
py> b = 3**1000
py> float(a)/b  # Exact answer should be 2
Traceback (most recent call last):
   File "", line 1, in 
OverflowError: long int too large to convert to float


Note that Python gets the integer division correct:

py> a//b
2L


And even gets true division correct:

py> from __future__ import division
py> a/b
2.0


so it's just the intermediate conversion to float that fails.



Thanks! I did see recommandations to avoid floats throughout the thread, 
but didn't understand why.


Following code should be exempt from such shortcomings :

def intdiv(a, b):
return (a - (a % (-b if a < 0 else b))) / b


--
https://mail.python.org/mailman/listinfo/python-list


Re: How to simulate C style integer division?

2016-01-23 Thread Grobu

def intdiv(a, b):
 return (a - (a % (-b if a < 0 else b))) / b




Duh ... Got confused with modulos (again).

def intdiv(a, b):
return (a - (a % (-abs(b) if a < 0 else abs(b / b

--
https://mail.python.org/mailman/listinfo/python-list


Re: How to simulate C style integer division?

2016-01-23 Thread Grobu

On 23/01/16 16:07, Jussi Piitulainen wrote:

Grobu writes:


def intdiv(a, b):
  return (a - (a % (-b if a < 0 else b))) / b




Duh ... Got confused with modulos (again).

def intdiv(a, b):
 return (a - (a % (-abs(b) if a < 0 else abs(b / b


You should use // here to get an exact integer result.



You're totally right, thanks! It isn't an issue for Python 2.x's 
"classic division", but it becomes one with Python 3's "True division", 
and '//' allows to play it safe all along.

--
https://mail.python.org/mailman/listinfo/python-list


Re: python-2.7.3 vs python-3.2.3

2016-01-26 Thread Grobu

On 26/01/16 13:26, Gene Heskett wrote:

Greetings;

I have need of using a script written for python3, but the default python
on wheezy is 2.7.3.

I see in the wheezy repos that 3.2.3-6 is available.

Can/will they co-exist peacefully?

Thank you.

Cheers, Gene Heskett



On Debian Jessie :

$ ls -F /usr/bin/pyth*
/usr/bin/python@ /usr/bin/python3@ /usr/bin/python3m@
/usr/bin/python2@/usr/bin/python3.4*
/usr/bin/python2.7*  /usr/bin/python3.4m*

... so it seems they can co-exist peacefully.

--
https://mail.python.org/mailman/listinfo/python-list


Re: show instant data on webpage

2016-01-29 Thread Grobu

On 26/01/16 17:10, mustang wrote:

I've built a sensor to measure some values.
I would like to show it on a web page with python.


This is an extract of the code:

file  = open("myData.dat", "w")

while True:
 temp = sensor.readTempC()
 riga = "%f\n" % temp
 file.write(riga)
 time.sleep(1.0)

file.close()

Until this all ok.
Then in PHP I read the file and show it on internet. It works ok but...
First problem. I've to stop the streaming with CTRl-C to load the PHP
file because if I try to read during measurement I cannot show anything
(I think because the file is in using).
How can I show time by time in a webpage the output?



Perhaps you could use a Javascript plotting library (like the one at 
flotcharts.org), and use Python only as a CGI agent to serve plotting 
data when requested by Javascript, at regular intervals?


--
https://mail.python.org/mailman/listinfo/python-list


Re: [STORY-TIME] THE BDFL AND HIS PYTHON PETTING ZOO

2016-02-03 Thread Grobu

On 03/02/16 04:26, Rick Johnson wrote:
[ ... ]

And many children came from the far and wide, and they
would pet his snake, and they would play with his snake


Didn't know Pedobear had a biographer.
--
https://mail.python.org/mailman/listinfo/python-list


Re: Set Operations on Dicts

2016-02-08 Thread Grobu

You can use dictionary comprehension :

Say :
dict1 = {'a': 123, 'b': 456}
set1 = {'a'}

intersection :
>>> { key:dict1[key] for key in dict1 if key in set1 }
{'a': 123}

difference :
>>> { key:dict1[key] for key in dict1 if not key in set1 }
{'b': 456}

--
https://mail.python.org/mailman/listinfo/python-list


Re: Set Operations on Dicts

2016-02-08 Thread Grobu

On 08/02/16 17:12, Ian Kelly wrote:


dict does already expose set-like views. How about:

{k: d[k] for k in d.keys() & s}  # d & s
{k: d[k] for k in d.keys() - s}  # d - s


Interesting. But seemingly only applies to Python 3.
--
https://mail.python.org/mailman/listinfo/python-list


Re: What is the right way to import a package?

2015-11-16 Thread Grobu

On 14/11/15 21:00, fl wrote:

Hi,

I want to use a code snippet found on-line. It has such content:

from numpy import *
dt = 0.1
# Initialization of state matrices
X = array([[0.0], [0.0], [0.1], [0.1]])

# Measurement matrices
Y = array([[X[0,0] + abs(randn(1)[0])], [X[1,0] + abs(randn(1)[0])]])



When the above content is inside a .py document and running, there will be
  an error:

---> 15 Y = array([[X[0,0] + abs(randn(1)[0])], [X[1,0] + abs(randn(1)[0])]])
  16 #Y = ([[X[0,0]], [X[1,0] + 0]])

NameError: name 'randn' is not defined


But when I run the above line by line at the console (Canopy), there will be
no error for the above line.

My question is:

The import and the following are wrong.

X = array([[0.0], [0.0], [0.1], [0.1]])

It should be:

import numpy as np
...
Y = np.array([[X[0,0] + abs(np.randn(1)[0])], [X[1,0] + abs(np.randn(1)[0])]])

This looks like the code I once saw. But the file when running has such
  error:

---> 15 Y = np.array([[X[0,0] + abs(np.randn(1)[0])], [X[1,0] + 
abs(np.randn(1)[0])]])

AttributeError: 'module' object has no attribute 'randn'

When it is run line by line at the console, it has the same error.

It is strange that the same content has errors depends on inside a file, or
at CLI console.

What is missing I don't realize? Thanks,




You can try :
from numpy import *
from numpy.random import *

HTH,

- Grobu -

--
https://mail.python.org/mailman/listinfo/python-list