Re: Newbie trying HTML parsing

Mike Wed, 14 Oct 2015 17:28:19 -0700

So now I'm trying to make the conversion to Crouton.  Of course that is not 
going well.  Here is a chunk of code:


(ns one.core
  (:gen-class))

(require '[clj-http.client :as client]
         '[clojure.zip :as z]
         '[clojure.data.zip :as dz]
         '[clojure.data.zip.xml :as dzx]
         '[crouton.html :as html])

(defn get-post-data [url]
  (client/get url))

(def response (get-post-data login-URL))

(html/parse (:body response))

The *response *value is correct (its the HTML), but when I try to execute 
the *html/parse* I get:

java.io.FileNotFoundException: 


<html>
<head id="Head1"><title>
 User Login Page
<\title>
    <style>
        body
        {
            color: #000000;
            font: 12px\1.4 arial,FreeSans,Helvetica,sans-serif;
            margin: 0;


... TONS OF HTML DELETED ...


    <\center>    
    <\form>
<\body>
<\html>
 (The filename or extension is too long)
         (Unknown Source) java.io.FileInputStream.open0
 FileInputStream.java:195 java.io.FileInputStream.open
 FileInputStream.java:138 java.io.FileInputStream.<init>
 
... LOTS OF STACK TRACE DELETED ...

I hope someone can help.  TIA.


On Wednesday, October 14, 2015 at 6:03:29 PM UTC-5, James Reeves wrote:
>
> I'm not that familiar with Enlive, so I can't comment on the ease of that 
> approach.
>
> However, the way I'd personally do it is that I'd make use of Crouton and 
> the zipper functions in clojure.zip and clojure.data.zip. A zipper is a 
> functional way of navigating an immutable data structure.
>
> So first add Crouton and data.zip to your dependencies:
>
>   [[crouton "0.1.2"]
>    [org.clojure/data.zip "0.1.1"]]
>
> Then use Crouton to parse the body of the response from clj-http:
>
>   (crouton.html/parse (:body response))
>
> This will give you a data structure that's compatible with clojure.xml, 
> and therefore compatible with the XML zipper functions
>
>     (dzx/xml1-> (z/xml-zip parsed-html)
>               dz/descendents
>               (dzx/tag= "input")
>               (dzx/attr= "name" "foo")
>
> In the above case I'm using the following namespace aliases:
>
>   (require '[clojure.zip :as z]
>            '[clojure.data.zip :as dz]
>            '[clojure.data.zip.xml :as dzx])
>
> It's been a while since I've needed to traverse X/HTML in Clojure though, 
> so my code might be a little off.
>
> - James
>
> On 14 October 2015 at 22:53, Mike <[email protected] 
> <javascript:>> wrote:
>
>> Thanks James!  You helped me get another step along the way, I got this 
>> working.
>>
>> Of course you mentioned Crouton; you should and I asked for advice on my 
>> approach.  So please allow me to expand the problem statement and you may 
>> advise me further...
>>
>> Once I get this HTML parsed, I know that somewhere buried in this page is 
>> an *<input>* tag that has *name="name"* attribute where I will specify 
>> the name value at run time.  I will need to be able to programmatically 
>> find this tag and pul some values out of it.  Will using *clj-tagsoup* 
>> or *Crouton *make this location operation easier?  Perhaps even using 
>> *Enlive 
>> *might make it easier, since the location and path to the tag is not 
>> known; it must be located.
>>
>> On Wednesday, October 14, 2015 at 1:53:11 PM UTC-5, James Reeves wrote: 
>>>
>>>
>>> Crouton is an alternative HTML parsing library (that's coincidentally 
>>> written by me) and can be found at: 
>>> https://github.com/weavejester/crouton
>>>
>>> Crouton uses a slightly different output syntax, which is compatible 
>>> with Clojure's xml zipper functions, making it more suitable for document 
>>> searches and traversal (IMO).
>>>
>>> - James
>>>
>>>
>

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: Newbie trying HTML parsing

Reply via email to