Well, not even Google parse those. I'm not sure about Nutch but in some
crawlers (jSoup i believe) there's an option to try to get full URLs from
plain text, so you can capture some urls in the form of someClickFunction('
http://www.someurl.com/whatever') or even if they are in the middle of some
paragraph. Sometimes it works beautifully, sometimes it misleads you to
parse urls shortened with ellipsis in the middle.



alexei martchenko
Facebook <http://www.facebook.com/alexeiramone> |
Linkedin<http://br.linkedin.com/in/alexeimartchenko>|
Steam <http://steamcommunity.com/id/alexeiramone/> |
4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone |
Github <https://github.com/alexeiramone> | (11) 9 7613.0966 |


2014-01-28 rashmi maheshwari <maheshwari.ras...@gmail.com>

> Thanks All for quick response.
>
> Today I crawled a webpage using nutch. This page have many links. But all
> anchor tags have "href=#" and javascript is written on onClick event of
> each anchor tag to open a new page.
>
> So crawler didnt crawl any of those links which were opening using onClick
> event and has # href value.
>
> How these links are crawled using nutch?
>
>
>
>
> On Tue, Jan 28, 2014 at 10:54 PM, Alexei Martchenko <
> ale...@martchenko.com.br> wrote:
>
> > 1) Plus, those files are binaries sometimes with metadata, specific
> > crawlers need to understand them. html is a plain text
> >
> > 2) Yes, different data schemes. Sometimes I replicate the same core and
> > make some A-B tests with different weights, filters etc etc and some
> people
> > like to creare CoreA and CoreB with the same schema and hammer CoreA with
> > updates and commits and optmizes, they make it available for searches
> while
> > hammering CoreB. Then swap again. This produces faster searches.
> >
> >
> > alexei martchenko
> > Facebook <http://www.facebook.com/alexeiramone> |
> > Linkedin<http://br.linkedin.com/in/alexeimartchenko>|
> > Steam <http://steamcommunity.com/id/alexeiramone/> |
> > 4sq<https://pt.foursquare.com/alexeiramone>| Skype: alexeiramone |
> > Github <https://github.com/alexeiramone> | (11) 9 7613.0966 |
> >
> >
> > 2014-01-28 Jack Krupansky <j...@basetechnology.com>
> >
> > > 1. Nutch follows the links within HTML web pages to crawl the full
> graph
> > > of a web of pages.
> > >
> > > 2. Think of a core as an SQL table - each table/core has a different
> type
> > > of data.
> > >
> > > 3. SolrCloud is all about scaling and availability - multiple shards
> for
> > > larger collections and multiple replicas for both scaling of query
> > response
> > > and availability if nodes go down.
> > >
> > > -- Jack Krupansky
> > >
> > > -----Original Message----- From: rashmi maheshwari
> > > Sent: Tuesday, January 28, 2014 11:36 AM
> > > To: solr-user@lucene.apache.org
> > > Subject: Solr & Nutch
> > >
> > >
> > > Hi,
> > >
> > > Question1 --> When Solr could parse html, documents like doc, excel pdf
> > > etc, why do we need nutch to parse html files? what is different?
> > >
> > > Questions 2: When do we use multiple core in solar? any practical
> > business
> > > case when we need multiple cores?
> > >
> > > Question 3: When do we go for cloud? What is meaning of implementing
> solr
> > > cloud?
> > >
> > >
> > > --
> > > Rashmi
> > > Be the change that you want to see in this world!
> > > www.minnal.zor.org
> > > disha.resolve.at
> > > www.artofliving.org
> > >
> >
>
>
>
> --
> Rashmi
> Be the change that you want to see in this world!
> www.minnal.zor.org
> disha.resolve.at
> www.artofliving.org
>

Reply via email to