2011/6/6 Base <[email protected]>:
> hi all,
>
> I am working on an app that will parse web pages to do some NLP and
> statistics.  I am able to parse the HTML using several different tool
> ( enlive, HTML parser, etc).  However I would like to discard all the
> rest of the junk in the web page that is not pertinent (I.e. Ads).
> Does anyone have any experience doing this?  Any tips On how to do
> this - or even better, tools that you can recommend?   I have been
> digging around on this for a while now and am stuck!
>
> Thanks!
>
> Base

In Enlive there are at least two approaches available:

The first approach is to use the 'select' function to pick out the
interesting part of the element tree. You use CSS-style selectors to
describe the element.

The second approach is to use the 'at' macro. You give it an element
tree and pairs of selectors and transformations. For each
selector-transformation pair, the transformation is applied to all
elements that matches the selector. A transformation takes a node and
returns what it should be replaced with. You can do almost anything
with them, including removing the element (which might be useful for
the ads in your case) or extracting the text of the node (the matching
nodes deepest in the tree are processed first). The result of the 'at'
form is the element tree with all transformations applied.

Both 'select' and 'at' accepts a element tree which you can create
with the html-resource function which accepts, among other things,
URLs.

You probably need to write some html element processing functions, so
it's probably a good idea to get familiar with the data format of the
nodes:

    Element: {:tag :a, :attrs {:href "http://example.com/"}, :content
<sequence of nodes>}
    Text: "text node"
    Comment: {:type :comment, :data "comment node"}

I found the wiki of Enlive very useful. The "Getting Started" explains
what's there and how to use it very well, I think.
https://github.com/cgrand/enlive/wiki/_pages

I should also mention David Nolen's comprehensive tutorial which
begins with scraping: https://github.com/swannodette/enlive-tutorial

// raek

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

Reply via email to