I have been working on a program that will take a website, and extract all
the links from the body of the HTML page. I am using tagsoup
<https://github.com/nathell/clj-tagsoup> to create a tree structure from an
html page.
The current issue I am running into is traversing the tree structure and
pulling out all the links. I have a function that will parses the tree
using a for loop and recursion, but it does not feel very idiomatic. The
list it returns is filled with vectors of emtpy lists and nil values. I can
flatten out the data structure and grab everything I need out of it, but it
feels clunky. I was looking for some tips on how I could impore my code,
since this is the first complicated clojure program I have written.
Here is the code I have written for extracting the a tags out of the html
tree.
(defn get-tags
([tag html]
(get-tags tag [] html))
([tag found html]
(if html
(for [el html]
(if (vector? el)
(if (= (soup/tag el) tag)
(conj found el)
(->> (soup/children el)
(remove #(or (string? %) (nil? %)))
(get-tags tag found)
(conj found))))))))
It gets called with this something like this. Normally the site would be a
lot bigger, but I deleted a lot of the tree for this post.
(def html-tree
[:body
{}
[:a
{:href "conditionedtransiti.php", :shape "rect", :style "display: none;"}
"triangular-nordic"]
[:table
{}
[:tr
{}
[:td
{:colspan "1", :rowspan "1"}
[:a
{:href "/files/", :shape "rect"}
[:img {:src "truck.gif", :title "Slug's File Archive"}]]]
[:td
{:colspan "1", :rowspan "1"}
[:a {:href "/docs/", :shape "rect"} [:img {:src "magnify.gif"}]]]
[:td
{:colspan "1", :rowspan "1"}
[:a
{:href
"http://forecast.weather.gov/MapClick.php?lat=35.045627427000454&lon=-85.30967786199966",
:shape "rect"}
"Forecast"]
[:br {:clear "none"}]
[:a
{:href
"http://radar.weather.gov/radar.php?rid=htx&product=N0R&overlay=11101111&loop=no",
:shape "rect"}
"Radar"]
[:br {:clear "none"}]
[:a {:href "http://news.google.com/", :shape "rect"} "News"]
[:br {:clear "none"}]]
[:td
{:colspan "1", :rowspan "1"}
[:a {:href "http://reddit.com", :shape "rect"} "Reddit"]
[:br {:clear "none"}]
[:a {:href "http://digg.com", :shape "rect"} "Digg"]
[:br {:clear "none"}]]]]])
(get-tags :a html-tree)
This evaluates to
(nil
nil
[[:a
{:href "conditionedtransiti.php", :shape "rect", :style "display: none;"}
"triangular-nordic"]]
[([([([[:a
{:href "/files/", :shape "rect"}
[:img {:src "truck.gif", :title "Slug's File Archive"}]]])]
[([[:a {:href "/docs/", :shape "rect"} [:img {:src "magnify.gif"}]]])]
[([[:a
{:href
"http://forecast.weather.gov/MapClick.php?lat=35.045627427000454&lon=-85.30967786199966",
:shape "rect"}
"Forecast"]]
[()]
[[:a
{:href
"http://radar.weather.gov/radar.php?rid=htx&product=N0R&overlay=11101111&loop=no",
:shape "rect"}
"Radar"]]
[()]
[[:a {:href "http://news.google.com/", :shape "rect"} "News"]]
[()])]
[([[:a {:href "http://reddit.com", :shape "rect"} "Reddit"]]
[()]
[[:a {:href "http://digg.com", :shape "rect"} "Digg"]]
[()])])])])
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.