Extracting tags from parsed html

Danny Freeman Sat, 09 Apr 2016 12:48:06 -0700

I have been working on a program that will take a website, and extract all 
the links from the body of the HTML page. I am using tagsoup 
<https://github.com/nathell/clj-tagsoup> to create a tree structure from an 
html page.


The current issue I am running into is traversing the tree structure and 
pulling out all the links. I have a function that will parses the tree 
using a for loop and recursion, but it does not feel very idiomatic. The 
list it returns is filled with vectors of emtpy lists and nil values. I can 
flatten out the data structure and grab everything I need out of it, but it 
feels clunky. I was looking for some tips on how I could impore my code, 
since this is the first complicated clojure program I have written. 

Here is the code I have written for extracting the a tags out of the html 
tree.

(defn get-tags 
  ([tag html]
    (get-tags tag [] html))

  ([tag found html]
    (if html
      (for [el html]
        (if (vector? el)
          (if (= (soup/tag el) tag)
            (conj found el)
            (->> (soup/children el)
                   (remove #(or (string? %) (nil? %)))
                   (get-tags tag found)
                   (conj found))))))))

It gets called with this something like this. Normally the site would be a 
lot bigger, but I deleted a lot of the tree for this post.

(def html-tree 
[:body
 {}
 [:a
  {:href "conditionedtransiti.php", :shape "rect", :style "display: none;"}
  "triangular-nordic"]
 [:table
  {}
  [:tr
   {}
   [:td
    {:colspan "1", :rowspan "1"}
    [:a
     {:href "/files/", :shape "rect"}
     [:img {:src "truck.gif", :title "Slug's File Archive"}]]]
   [:td
    {:colspan "1", :rowspan "1"}
    [:a {:href "/docs/", :shape "rect"} [:img {:src "magnify.gif"}]]]
   [:td
    {:colspan "1", :rowspan "1"}
    [:a
     {:href 
"http://forecast.weather.gov/MapClick.php?lat=35.045627427000454&lon=-85.30967786199966";,
      :shape "rect"}
     "Forecast"]
    [:br {:clear "none"}]
    [:a
     {:href 
"http://radar.weather.gov/radar.php?rid=htx&product=N0R&overlay=11101111&loop=no";,
      :shape "rect"}
     "Radar"]
    [:br {:clear "none"}]
    [:a {:href "http://news.google.com/";, :shape "rect"} "News"]
    [:br {:clear "none"}]]
   [:td
    {:colspan "1", :rowspan "1"}
    [:a {:href "http://reddit.com";, :shape "rect"} "Reddit"]
    [:br {:clear "none"}]
    [:a {:href "http://digg.com";, :shape "rect"} "Digg"]
    [:br {:clear "none"}]]]]])

(get-tags :a html-tree)

This evaluates to 
(nil
 nil
 [[:a
   {:href "conditionedtransiti.php", :shape "rect", :style "display: none;"}
   "triangular-nordic"]]
 [([([([[:a
         {:href "/files/", :shape "rect"}
         [:img {:src "truck.gif", :title "Slug's File Archive"}]]])]
     [([[:a {:href "/docs/", :shape "rect"} [:img {:src "magnify.gif"}]]])]
     [([[:a
         {:href 
"http://forecast.weather.gov/MapClick.php?lat=35.045627427000454&lon=-85.30967786199966";,
          :shape "rect"}
         "Forecast"]]
       [()]
       [[:a
         {:href 
"http://radar.weather.gov/radar.php?rid=htx&product=N0R&overlay=11101111&loop=no";,
          :shape "rect"}
         "Radar"]]
       [()]
       [[:a {:href "http://news.google.com/";, :shape "rect"} "News"]]
       [()])]
     [([[:a {:href "http://reddit.com";, :shape "rect"} "Reddit"]]
       [()]
       [[:a {:href "http://digg.com";, :shape "rect"} "Digg"]]
       [()])])])])

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Extracting tags from parsed html

Reply via email to