On Saturday, May 26, 2018 at 3:54:37 AM UTC+5:30, Cameron Simpson wrote:
> On 25May2018 04:23, Subhabrata Banerjee wrote:
> >On Friday, May 25, 2018 at 3:59:57 AM UTC+5:30, Cameron Simpson wrote:
> >> On 24May2018 03:13, wrote:
> >> >I have a text as,
> >> >
> >> >"Hawaii volcano generates toxic gas plume called laze PAHOA: The eruption
> >> >of Kilauea volcano in Hawaii sparked new safety warnings about toxic gas
> >> >on the Big Island's southern coastline after lava began flowing into the
> >> >ocean and setting off a chemical reaction. Lava haze is made of dense
> >> >white clouds of steam, toxic gas and tiny shards of volcanic glass. Janet
> >> >Babb, a geologist with the Hawaiian Volcano Observatory, says the plume
> >> >"looks innocuous, but it's not." "Just like if you drop a glass on your
> >> >kitchen floor, there's some large pieces and there are some very, very
> >> >tiny pieces," Babb said. "These little tiny pieces are the ones that can
> >> >get wafted up in that steam plume." Scientists call the glass Limu O
> >> >Pele, or Pele's seaweed, named after the Hawaiian goddess of volcano and
> >> >fire"
> >> >
> >> >and I want to see its tagged output as,
> >> >
> >> >"Hawaii/TAG volcano generates toxic gas plume called laze PAHOA/TAG: The
> >> >eruption of Kilauea/TAG volcano/TAG in Hawaii/TAG sparked new safety
> >> >warnings about toxic gas on the Big Island's southern coastline after
> >> >lava began flowing into the ocean and setting off a chemical reaction.
> >> >Lava haze is made of dense white clouds of steam, toxic gas and tiny
> >> >shards of volcanic glass. Janet/TAG Babb/TAG, a geologist with the
> >> >Hawaiian/TAG Volcano/TAG Observatory/TAG, says the plume "looks
> >> >innocuous, but it's not." "Just like if you drop a glass on your kitchen
> >> >floor, there's some large pieces and there are some very, very tiny
> >> >pieces," Babb/TAG said. "These little tiny pieces are the ones that can
> >> >get wafted up in that steam plume." Scientists call the glass Limu/TAG
> >> >O/TAG Pele/TAG, or Pele's seaweed, named after the Hawaiian goddess of
> >> >volcano and fire"
> >> >
> >> >To do this I generally try to take a list at the back end as,
> >> >
> >> >Hawaii
> >> >PAHOA
> [...]
> >> >and do a simple code as follows,
> >> >
> >> >def tag_text():
> >> > corpus=open("/python27/volcanotxt.txt","r").read().split()
> >> > wordlist=open("/python27/taglist.txt","r").read().split()
> [...]
> >> > list1=[]
> >> > for word in corpus:
> >> > if word in wordlist:
> >> > word_new=word+"/TAG"
> >> > list1.append(word_new)
> >> > else:
> >> > list1.append(word)
> >> > lst1=list1
> >> > tagged_text=" ".join(lst1)
> >> > print tagged_text
> >> >
> >> >get the results and hand repair unwanted tags Hawaiian/TAG goddess of
> >> >volcano/TAG.
> >> >I am looking for a better approach of coding so that I need not spend
> >> >time on
> >> >hand repairing.
> >>
> >> It isn't entirely clear to me why these two taggings are unwanted.
> >> Intuitively,
> >> they seem to be either because "Hawaiian goddess" is a compound term where
> >> you
> >> don't want "Hawaiian" to get a tag, or because "Hawaiian" has already
> >> received
> >> a tag earlier in the list. Or are there other criteria.
> >>
> >> If you want to solve this problem with a programme you must first clearly
> >> define what makes an unwanted tag "unwanted". [...]
> >
> >By unwanted I did not mean anything so intricate.
> >Unwanted meant things I did not want.
>
> That much was clear, but you need to specify in your own mind _precisely_
> what
> makes some things unwanted and others wanted. Without concrete criteria you
> can't write code to implement those criteria.
>
> I'm not saying "you need to imagine code to match these things": you're
> clearly
> capable of doing that. I'm saying you need to have well defined concepts of
> what makes something unwanted (or, if that is easier to define, wanted). You
> can do that iteratively: start with your basic concept and see how well it
> works. When those concepts don't give you the outcome you desire, consider a
> specific example which isn't working and try to figure out what additional
> criterion would let you distinguish it from a working example.
>
> >For example,
> >if my target phrases included terms like,
> >government of Mexico,
> >
> >now in my list I would have words with their tags as,
> >government
> >of
> >Mexico
> >
> >If I put these words in list it would tag
> >government/TAG of/TAG Mexico
> >
> >but would also tag all the "of" which may be
> >anywhere like haze is made of/TAG dense white,
> >clouds of/TAG steam, etc.
> >
> >Cleaning these unwanted places become a daunting task
> >to me.
>
> Richard Damon has pointed out that you seem to want phrases instead of just
> words.
>
> >I have been experimenting around
> >wordlist=["Kilauea volcano","Kilauea/TAG
> >volcano/TAG"),("Hawaii","Hawaii/TAG"),...]
> >tag=reduce(lambda a, kv: a.replace(*kv), wordlist, corpus)
> >
> >is giving me sizeably good result but size of the wordlist is slight concern.
>
> You can reduce that list by generating the "wordlist" form from something
> smaller:
>
> base_phrases = ["Kilauea volcano", "government of Mexico", "Hawaii"]
> wordlist = [
> (base_phrase, " ".join([word + "/TAG" for word in base_phrase.split()]))
> for base_phrase in base_phrases
> ]
>
> You could even autosplit the longer phrases so that your base_phrases
> _automatically_ becomes:
>
> base_phrases = ["Kilauea volcano", "Kilauea", "volcano", "government of
> Mexico", "government", "Mexico", "Hawaii"]
>
> That way your "replace" call would find the longer phrases before the shorter
> phrases and thus _not_ tag the single words if they occurred in a longer
> phrase, while still tagging the single words when they _didn't_ land in a
> longer phrase.
>
> Also, it is unclear to me whether "/TAG" is a fixed string or intended to be
> distinct such as "/PROPER_NOUN", "/LOCATION" etc. If they vary then you need
> a
> more elaborate setup.
>
> It sounds like you want a more general purpose parser, and that depends upon
> your purposes. If you're coding to learn the basics of breaking up text, what
> you're doing is fine and I'd stick with it. But if you're just after the
> outcome (tags), you could use other libraries to break up the text.
>
> For example, the Natural Language ToolKit (NLTK) will do structured parsing
> of
> text and return you a syntax tree, and it has many other facilities. Doco:
>
> http://www.nltk.org/
>
> PyPI module:
>
> https://pypi.org/project/nltk/
>
> which you can install with the command:
>
> pip install --user nltk
>
> That would get you a tree structure of the corpus, which you could process
> more
> meaningfully. For example, you could traverse the tree and tag higher level
> nodes as you came across them, possibly then _not_ traversing their inner
> nodes. The effect of that would be that if you hit the grammatic node:
>
> government of Mexico
>
> you might tags that node with "ORGANISATION", and choose not to descend
> inside
> it, thus avoiding tagging "government" and "of" and so forth because you have
> a
> high level tags. Nodes not specially recognised you're keep descending into,
> tagging smaller things.
>
> Cheers,
> Cameron Simpson
Dear Sir,
Thank you for your kind and valuable suggestions. Thank you for your kind time
too.
I know NLTK and machine learning. I am of belief if I may use language properly
we need machine learning-the least.
So, I am trying to design a tagger without the help of machine learning, by
simple Python coding. I have thus removed standard Parts of Speech(PoS) or
Named Entity (NE) tagging scheme.
I am trying to design a basic model if required may be implemented on any one
of these problems.
Detecting longer phrase is slightly a problem now I am thinking to employ
re.search(pattern,text). If this part is done I do not need machine learning.
Maintaining so much data is a cumbersome issue in machine learning.
My regards to all other esteemed coders and members of the group for their kind
and valuable time and valuable suggestions.
--
https://mail.python.org/mailman/listinfo/python-list