Thanks for the reply! I like how simple the solution is. I ended up with a
function looking like this:
(defn get-tags
[tag html]
(let [tags (-> (clojure.zip/xml-zip html)
(xml-> clojure.data.zip/descendants tag))]
(for [t tags]
(first t))))
The data structure I get from this makes it really easy to extract all the
href's out as well.
On Saturday, April 9, 2016 at 4:27:10 PM UTC-4, James Reeves wrote:
>
> If you parse into data structure compatible with clojure.xml, then you can
> use an XML zipper to find the links in the document.
>
> (-> s
> pl.danieljanus.tagsoup/parse-xml
> clojure.zip/xml-zip
> (clojure.data.zip/xml-> clojure.data.zip/descendants :a))
>
> - James
>
> On 9 April 2016 at 20:15, Danny Freeman <[email protected] <javascript:>
> > wrote:
>
>> I have been working on a program that will take a website, and extract
>> all the links from the body of the HTML page. I am using tagsoup
>> <https://github.com/nathell/clj-tagsoup> to create a tree structure from
>> an html page.
>>
>> The current issue I am running into is traversing the tree structure and
>> pulling out all the links. I have a function that will parses the tree
>> using a for loop and recursion, but it does not feel very idiomatic. The
>> list it returns is filled with vectors of emtpy lists and nil values. I can
>> flatten out the data structure and grab everything I need out of it, but it
>> feels clunky. I was looking for some tips on how I could impore my code,
>> since this is the first complicated clojure program I have written.
>>
>> Here is the code I have written for extracting the a tags out of the html
>> tree.
>>
>> (defn get-tags
>> ([tag html]
>> (get-tags tag [] html))
>>
>> ([tag found html]
>> (if html
>> (for [el html]
>> (if (vector? el)
>> (if (= (soup/tag el) tag)
>> (conj found el)
>> (->> (soup/children el)
>> (remove #(or (string? %) (nil? %)))
>> (get-tags tag found)
>> (conj found))))))))
>>
>> It gets called with this something like this. Normally the site would be
>> a lot bigger, but I deleted a lot of the tree for this post.
>>
>> (def html-tree
>> [:body
>> {}
>> [:a
>> {:href "conditionedtransiti.php", :shape "rect", :style "display:
>> none;"}
>> "triangular-nordic"]
>> [:table
>> {}
>> [:tr
>> {}
>> [:td
>> {:colspan "1", :rowspan "1"}
>> [:a
>> {:href "/files/", :shape "rect"}
>> [:img {:src "truck.gif", :title "Slug's File Archive"}]]]
>> [:td
>> {:colspan "1", :rowspan "1"}
>> [:a {:href "/docs/", :shape "rect"} [:img {:src "magnify.gif"}]]]
>> [:td
>> {:colspan "1", :rowspan "1"}
>> [:a
>> {:href "
>> http://forecast.weather.gov/MapClick.php?lat=35.045627427000454&lon=-85.30967786199966
>> ",
>> :shape "rect"}
>> "Forecast"]
>> [:br {:clear "none"}]
>> [:a
>> {:href "
>> http://radar.weather.gov/radar.php?rid=htx&product=N0R&overlay=11101111&loop=no
>> ",
>> :shape "rect"}
>> "Radar"]
>> [:br {:clear "none"}]
>> [:a {:href "http://news.google.com/", :shape "rect"} "News"]
>> [:br {:clear "none"}]]
>> [:td
>> {:colspan "1", :rowspan "1"}
>> [:a {:href "http://reddit.com", :shape "rect"} "Reddit"]
>> [:br {:clear "none"}]
>> [:a {:href "http://digg.com", :shape "rect"} "Digg"]
>> [:br {:clear "none"}]]]]])
>>
>> (get-tags :a html-tree)
>>
>> This evaluates to
>> (nil
>> nil
>> [[:a
>> {:href "conditionedtransiti.php", :shape "rect", :style "display:
>> none;"}
>> "triangular-nordic"]]
>> [([([([[:a
>> {:href "/files/", :shape "rect"}
>> [:img {:src "truck.gif", :title "Slug's File Archive"}]]])]
>> [([[:a {:href "/docs/", :shape "rect"} [:img {:src
>> "magnify.gif"}]]])]
>> [([[:a
>> {:href "
>> http://forecast.weather.gov/MapClick.php?lat=35.045627427000454&lon=-85.30967786199966
>> ",
>> :shape "rect"}
>> "Forecast"]]
>> [()]
>> [[:a
>> {:href "
>> http://radar.weather.gov/radar.php?rid=htx&product=N0R&overlay=11101111&loop=no
>> ",
>> :shape "rect"}
>> "Radar"]]
>> [()]
>> [[:a {:href "http://news.google.com/", :shape "rect"} "News"]]
>> [()])]
>> [([[:a {:href "http://reddit.com", :shape "rect"} "Reddit"]]
>> [()]
>> [[:a {:href "http://digg.com", :shape "rect"} "Digg"]]
>> [()])])])])
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to [email protected]
>> <javascript:>
>> Note that posts from new members are moderated - please be patient with
>> your first post.
>> To unsubscribe from this group, send email to
>> [email protected] <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "Clojure" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/d/optout.