Re: From JoyceUlysses.txt -- words occurring exactly once
On 2024-05-30 19:26:37 -0700, HenHanna via Python-list wrote: > hard to decide what to do with hyphens >and apostrophes > (I'd, he's, can't, haven't, A's and B's) Especially since the same character is used as both an apostrophe and a closing quotation mark. And while that's pretty unambiguous between to characters it isn't at the end of a word: This is Alex’ house. This type of building is called an ‘Alex’ house. The sentence ‘We are meeting at Alex’ house’ contains an apostrophe. (using proper unicode quotation marks. It get's worse if you stick to ASCII.) Personally I like to use U+0027 APOSTROPHE as an apostrophe and U+2018 LEFT SINGLE QUOTATION MARK and U+2019 RIGHT SINGLE QUOTATION MARK as single quotation marks[1], but despite the suggestive names, this is not the common typographical convention, so your texts are unlikely to make this distinction. hp [1] Which I use rarely, anyway. -- _ | Peter J. Holzer| Story must make more sense than reality. |_|_) || | | | [email protected] |-- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!" signature.asc Description: PGP signature -- https://mail.python.org/mailman/listinfo/python-list
Re: Lprint = ( Lisp-style printing ( of lists and strings (etc.) ) in Python )
On 2024-05-30 21:47:14 -0700, HenHanna via Python-list wrote:
> [('the', 36225), ('and', 17551), ('of', 16759), ('i', 16696), ('a', 15816),
> ('to', 15722), ('that', 11252), ('in', 10743), ('it', 10687)]
>
> ((the 36225) (and 17551) (of 16759) (i 16696) (a 15816) (to 15722) (that
> 11252) (in 10743) (it 10687))
>
>
> i think the latter is easier-to-read, so i use this code
>(by Peter Norvig)
This doesn't work well if your strings contain spaces:
Lprint(
[
["Just", "three", "words"],
["Just", "three words"],
["Just three", "words"],
["Just three words"],
]
)
prints:
((Just three words) (Just three words) (Just three words) (Just three words))
Output is often a compromise between readability and precision.
> def lispstr(exp):
># "Convert a Python object back into a Lisp-readable string."
> if isinstance(exp, list):
This won't work for your example, since you have a list of tuples, not a
list of lists and a tuple is not an instance of a list.
> return '(' + ' '.join(map(lispstr, exp)) + ')'
> else:
> return str(exp)
>
> def Lprint(x): print(lispstr(x))
I like to use pprint, but it's lacking support for user-defined types. I
should be able to add a method (maybe __pprint__?) to my classes which
handle proper formatting (with line breaks and indentation).
hp
--
_ | Peter J. Holzer| Story must make more sense than reality.
|_|_) ||
| | | [email protected] |-- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
signature.asc
Description: PGP signature
--
https://mail.python.org/mailman/listinfo/python-list
Re: From JoyceUlysses.txt -- words occurring exactly once
On 6/1/2024 4:04 AM, Peter J. Holzer via Python-list wrote: On 2024-05-30 19:26:37 -0700, HenHanna via Python-list wrote: hard to decide what to do with hyphens and apostrophes (I'd, he's, can't, haven't, A's and B's) Especially since the same character is used as both an apostrophe and a closing quotation mark. And while that's pretty unambiguous between to characters it isn't at the end of a word: This is Alex’ house. This type of building is called an ‘Alex’ house. The sentence ‘We are meeting at Alex’ house’ contains an apostrophe. (using proper unicode quotation marks. It get's worse if you stick to ASCII.) Personally I like to use U+0027 APOSTROPHE as an apostrophe and U+2018 LEFT SINGLE QUOTATION MARK and U+2019 RIGHT SINGLE QUOTATION MARK as single quotation marks[1], but despite the suggestive names, this is not the common typographical convention, so your texts are unlikely to make this distinction. hp [1] Which I use rarely, anyway. My usual approach is to replace punctuation by spaces and then to discard anything remaining that is only one character long (or sometimes two, depending on what I'm working on). Yes, OK, I will miss words like "I". Usually I don't care about them. Make exceptions to the policy if you like. -- https://mail.python.org/mailman/listinfo/python-list
Re: From JoyceUlysses.txt -- words occurring exactly once
On 5/31/24 11:59, Dieter Maurer via Python-list wrote:
hmmm, I "sent" this but there was some problem and it remained unsent.
Just in case it hasn't All Been Said Already, here's the retry:
HenHanna wrote at 2024-5-30 13:03 -0700:
Given a text file of a novel (JoyceUlysses.txt) ...
could someone give me a pretty fast (and simple) Python program that'd
give me a list of all words occurring exactly once?
Your task can be split into several subtasks:
* parse the text into words
This depends on your notion of "word".
In the simplest case, a word is any maximal sequence of non-whitespace
characters. In this case, you can use `split` for this task
This piece is by far "the hard part", because of the ambiguity. For
example, if I just say non-whitespace, then I get as distinct words
followed by punctuation. What about hyphenation - of which there's both
the compound word forms and the ones at the end of lines if the source
text has been formatted that way. Are all-lowercase words different
than the same word starting with a capital? What about non-initial
capitals, as happens a fair bit in modern usage with acronyms,
trademarks (perhaps not in Ulysses? :-) ), etc. What about accented letters?
If you want what's at least a quick starting point to play with, you
could use a very simple regex - a fair amount of thought has gone into
what a "word character" is (\w), so it deals with excluding both
punctuation and whitespace.
import re
from collections import Counter
with open("JoyceUlysses/txt", "r") as f:
wordcount = Counter(re.findall(r'\w+', f.read().lower()))
Now you have a Counter object counting all the "words" with their
occurrence counts (by this definition) in the document. You can fish
through that to answer the questions asked (find entries with a count of
1, 2, 3, etc.)
Some people Go Big and use something that actually tries to recognize
the language, and opposed to making assumptions from ranges of
characters. nltk is a choice there. But at this point it's not really
"simple" any longer (though nltk experts might end up disagreeing with
that).
--
https://mail.python.org/mailman/listinfo/python-list
