Re: Extracting tags from parsed html

Danny Freeman Sat, 09 Apr 2016 15:06:07 -0700

Thanks for the reply! I like how simple the solution is. I ended up with a 
function looking like this:


(defn get-tags
  [tag html]
  (let [tags (-> (clojure.zip/xml-zip html)
                     (xml-> clojure.data.zip/descendants tag))]
    (for [t tags]
      (first t))))

The data structure I get from this makes it really easy to extract all the 
href's out as well.

On Saturday, April 9, 2016 at 4:27:10 PM UTC-4, James Reeves wrote:
>
> If you parse into data structure compatible with clojure.xml, then you can 
> use an XML zipper to find the links in the document.
>
>   (-> s
>       pl.danieljanus.tagsoup/parse-xml
>       clojure.zip/xml-zip
>       (clojure.data.zip/xml-> clojure.data.zip/descendants :a))
>
> - James
>
> On 9 April 2016 at 20:15, Danny Freeman <[email protected] <javascript:>
> > wrote:
>
>> I have been working on a program that will take a website, and extract 
>> all the links from the body of the HTML page. I am using tagsoup 
>> <https://github.com/nathell/clj-tagsoup> to create a tree structure from 
>> an html page. 
>>
>> The current issue I am running into is traversing the tree structure and 
>> pulling out all the links. I have a function that will parses the tree 
>> using a for loop and recursion, but it does not feel very idiomatic. The 
>> list it returns is filled with vectors of emtpy lists and nil values. I can 
>> flatten out the data structure and grab everything I need out of it, but it 
>> feels clunky. I was looking for some tips on how I could impore my code, 
>> since this is the first complicated clojure program I have written. 
>>
>> Here is the code I have written for extracting the a tags out of the html 
>> tree.
>>
>> (defn get-tags 
>>   ([tag html]
>>     (get-tags tag [] html))
>>
>>   ([tag found html]
>>     (if html
>>       (for [el html]
>>         (if (vector? el)
>>           (if (= (soup/tag el) tag)
>>             (conj found el)
>>             (->> (soup/children el)
>>                    (remove #(or (string? %) (nil? %)))
>>                    (get-tags tag found)
>>                    (conj found))))))))
>>
>> It gets called with this something like this. Normally the site would be 
>> a lot bigger, but I deleted a lot of the tree for this post.
>>
>> (def html-tree 
>> [:body
>>  {}
>>  [:a
>>   {:href "conditionedtransiti.php", :shape "rect", :style "display: 
>> none;"}
>>   "triangular-nordic"]
>>  [:table
>>   {}
>>   [:tr
>>    {}
>>    [:td
>>     {:colspan "1", :rowspan "1"}
>>     [:a
>>      {:href "/files/", :shape "rect"}
>>      [:img {:src "truck.gif", :title "Slug's File Archive"}]]]
>>    [:td
>>     {:colspan "1", :rowspan "1"}
>>     [:a {:href "/docs/", :shape "rect"} [:img {:src "magnify.gif"}]]]
>>    [:td
>>     {:colspan "1", :rowspan "1"}
>>     [:a
>>      {:href "
>> http://forecast.weather.gov/MapClick.php?lat=35.045627427000454&lon=-85.30967786199966
>> ",
>>       :shape "rect"}
>>      "Forecast"]
>>     [:br {:clear "none"}]
>>     [:a
>>      {:href "
>> http://radar.weather.gov/radar.php?rid=htx&product=N0R&overlay=11101111&loop=no
>> ",
>>       :shape "rect"}
>>      "Radar"]
>>     [:br {:clear "none"}]
>>     [:a {:href "http://news.google.com/";, :shape "rect"} "News"]
>>     [:br {:clear "none"}]]
>>    [:td
>>     {:colspan "1", :rowspan "1"}
>>     [:a {:href "http://reddit.com";, :shape "rect"} "Reddit"]
>>     [:br {:clear "none"}]
>>     [:a {:href "http://digg.com";, :shape "rect"} "Digg"]
>>     [:br {:clear "none"}]]]]])
>>
>> (get-tags :a html-tree)
>>
>> This evaluates to 
>> (nil
>>  nil
>>  [[:a
>>    {:href "conditionedtransiti.php", :shape "rect", :style "display: 
>> none;"}
>>    "triangular-nordic"]]
>>  [([([([[:a
>>          {:href "/files/", :shape "rect"}
>>          [:img {:src "truck.gif", :title "Slug's File Archive"}]]])]
>>      [([[:a {:href "/docs/", :shape "rect"} [:img {:src 
>> "magnify.gif"}]]])]
>>      [([[:a
>>          {:href "
>> http://forecast.weather.gov/MapClick.php?lat=35.045627427000454&lon=-85.30967786199966
>> ",
>>           :shape "rect"}
>>          "Forecast"]]
>>        [()]
>>        [[:a
>>          {:href "
>> http://radar.weather.gov/radar.php?rid=htx&product=N0R&overlay=11101111&loop=no
>> ",
>>           :shape "rect"}
>>          "Radar"]]
>>        [()]
>>        [[:a {:href "http://news.google.com/";, :shape "rect"} "News"]]
>>        [()])]
>>      [([[:a {:href "http://reddit.com";, :shape "rect"} "Reddit"]]
>>        [()]
>>        [[:a {:href "http://digg.com";, :shape "rect"} "Digg"]]
>>        [()])])])])
>>
>> -- 
>> You received this message because you are subscribed to the Google
>> Groups "Clojure" group.
>> To post to this group, send email to [email protected] 
>> <javascript:>
>> Note that posts from new members are moderated - please be patient with 
>> your first post.
>> To unsubscribe from this group, send email to
>> [email protected] <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/clojure?hl=en
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "Clojure" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Extracting tags from parsed html

Reply via email to