If you parse into data structure compatible with clojure.xml, then you can
use an XML zipper to find the links in the document.

  (-> s
      pl.danieljanus.tagsoup/parse-xml
      clojure.zip/xml-zip
      (clojure.data.zip/xml-> clojure.data.zip/descendants :a))

- James

On 9 April 2016 at 20:15, Danny Freeman <[email protected]> wrote:

> I have been working on a program that will take a website, and extract all
> the links from the body of the HTML page. I am using tagsoup
> <https://github.com/nathell/clj-tagsoup> to create a tree structure from
> an html page.
>
> The current issue I am running into is traversing the tree structure and
> pulling out all the links. I have a function that will parses the tree
> using a for loop and recursion, but it does not feel very idiomatic. The
> list it returns is filled with vectors of emtpy lists and nil values. I can
> flatten out the data structure and grab everything I need out of it, but it
> feels clunky. I was looking for some tips on how I could impore my code,
> since this is the first complicated clojure program I have written.
>
> Here is the code I have written for extracting the a tags out of the html
> tree.
>
> (defn get-tags
>   ([tag html]
>     (get-tags tag [] html))
>
>   ([tag found html]
>     (if html
>       (for [el html]
>         (if (vector? el)
>           (if (= (soup/tag el) tag)
>             (conj found el)
>             (->> (soup/children el)
>                    (remove #(or (string? %) (nil? %)))
>                    (get-tags tag found)
>                    (conj found))))))))
>
> It gets called with this something like this. Normally the site would be a
> lot bigger, but I deleted a lot of the tree for this post.
>
> (def html-tree
> [:body
>  {}
>  [:a
>   {:href "conditionedtransiti.php", :shape "rect", :style "display: none;"}
>   "triangular-nordic"]
>  [:table
>   {}
>   [:tr
>    {}
>    [:td
>     {:colspan "1", :rowspan "1"}
>     [:a
>      {:href "/files/", :shape "rect"}
>      [:img {:src "truck.gif", :title "Slug's File Archive"}]]]
>    [:td
>     {:colspan "1", :rowspan "1"}
>     [:a {:href "/docs/", :shape "rect"} [:img {:src "magnify.gif"}]]]
>    [:td
>     {:colspan "1", :rowspan "1"}
>     [:a
>      {:href "
> http://forecast.weather.gov/MapClick.php?lat=35.045627427000454&lon=-85.30967786199966
> ",
>       :shape "rect"}
>      "Forecast"]
>     [:br {:clear "none"}]
>     [:a
>      {:href "
> http://radar.weather.gov/radar.php?rid=htx&product=N0R&overlay=11101111&loop=no
> ",
>       :shape "rect"}
>      "Radar"]
>     [:br {:clear "none"}]
>     [:a {:href "http://news.google.com/";, :shape "rect"} "News"]
>     [:br {:clear "none"}]]
>    [:td
>     {:colspan "1", :rowspan "1"}
>     [:a {:href "http://reddit.com";, :shape "rect"} "Reddit"]
>     [:br {:clear "none"}]
>     [:a {:href "http://digg.com";, :shape "rect"} "Digg"]
>     [:br {:clear "none"}]]]]])
>
> (get-tags :a html-tree)
>
> This evaluates to
> (nil
>  nil
>  [[:a
>    {:href "conditionedtransiti.php", :shape "rect", :style "display:
> none;"}
>    "triangular-nordic"]]
>  [([([([[:a
>          {:href "/files/", :shape "rect"}
>          [:img {:src "truck.gif", :title "Slug's File Archive"}]]])]
>      [([[:a {:href "/docs/", :shape "rect"} [:img {:src "magnify.gif"}]]])]
>      [([[:a
>          {:href "
> http://forecast.weather.gov/MapClick.php?lat=35.045627427000454&lon=-85.30967786199966
> ",
>           :shape "rect"}
>          "Forecast"]]
>        [()]
>        [[:a
>          {:href "
> http://radar.weather.gov/radar.php?rid=htx&product=N0R&overlay=11101111&loop=no
> ",
>           :shape "rect"}
>          "Radar"]]
>        [()]
>        [[:a {:href "http://news.google.com/";, :shape "rect"} "News"]]
>        [()])]
>      [([[:a {:href "http://reddit.com";, :shape "rect"} "Reddit"]]
>        [()]
>        [[:a {:href "http://digg.com";, :shape "rect"} "Digg"]]
>        [()])])])])
>
> --
> You received this message because you are subscribed to the Google
> Groups "Clojure" group.
> To post to this group, send email to [email protected]
> Note that posts from new members are moderated - please be patient with
> your first post.
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/clojure?hl=en
> ---
> You received this message because you are subscribed to the Google Groups
> "Clojure" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to