[this announcement is available online at http://s.apache.org/wpS]

Enterprise-scale Open Source search framework used for crawling intranets to 
global Web indexing.

Forest Hill, MD –10 July 2012– The Apache Software Foundation (ASF), the 
all-volunteer developers, stewards, and incubators of nearly 150 Open Source 
projects and initiatives, today announced Apache Nutch v2.0.

Apache Nutch is a highly scalable search framework written in Java. It is built 
on several Apache projects, including Solr™, Tika™, Hadoop™, and Gora™, among 
others, for crawling, a link-graph database, and parsing support for HTML and 
an array of other document formats.

"Having been at the origin of Open Source superstars such as Apache Hadoop or 
Apache Tika, Nutch now catches up with the NoSQL trends and adopts a table-like 
representation," said Apache Nutch Vice President Julien Nioche.

Apache Nutch is lauded for its flexible scalability and extensibility, and is 
the go-to choice for companies of all sizes, from start-ups and medium sized 
businesses to large scale organizations.

Under development for nearly two years, Nutch v2.0 covers many use cases, from 
small crawls on a single machine to running large scale deployments on Hadoop 
clusters. "Importantly, Nutch remains easy to customize thanks to its plugin 
architecture," explained Nioche. Its highly modular architecture allows 
developers to create plug-ins for document parsing, ranking and indexing.

"We use Nutch 2.0 for crawling at web scale because it is flexible, well 
maintained and scales with Hadoop. Crawling the Web in a robust, scalable and 
polite way may seem easy in theory. But in practice, it's not that simple," 
said Mathijs Homminga, CTO of Kalooga. "The Web is a wilderness and taming it 
requires knowledge and expertise on different levels. That's why we initially 
chose Nutch: it runs out of the box and contains the results of many, many, 
many, lessons-learned. It gave us a head start with crawling. But Nutch is not 
just a tool; Nutch is a flexible crawling framework which we can extend and 
modify to our needs."

Nutch v2.0 offers users an edition focused on large-scale crawling that builds 
on storage abstraction (via Apache Gora™) for big data stores such as Apache 
Accumulo™, Apache Avro™, Apache Cassandra™, Apache HBase™, Apache HDFS™ (Hadoop 
Distributed File System), an in-memory data store, and various high profile SQL 
stores.

"Our work on Nutch 2.0 gave birth to Apache Gora in the process, which it uses 
as an abstraction over the storage backends," added Nioche. "This enhanced 
architecture makes Nutch not only more efficient but also easier to integrate 
with external tools while still solving a large range of use cases ranging from 
single servers setups to large-scale Internet crawlers hosted in the cloud."

"2.0 has long been a community effort and something we've been eagerly 
anticipating," said Chris A. Mattmann, Vice President of Apache Tika and Apache 
OODT. "Nutch 2.0's close integration with Tika, and in turn, Tika's integration 
downstream into Apache OODT will undoubtedly bring all of our communities 
closer together, and will assist in the big data challenges that those in our 
projects regularly see. Nutch 2.0 makes full use of the latest features from 
Apache Tika, including its parsing and content detection capabilities."

"The fact that Nutch is implemented on top of Hadoop is essential for us since 
it allows us to be scalable in storage and processing --have you ever tried to 
reparse a billion web pages in a day?" stated Homminga. "Kalooga currently uses 
Nutch 2.0 in production, with the HBase backend, on a 34-node Hadoop cluster. 
Our current collection holds around a billion web pages, growing a few hundred 
million per month. We run indexes on Solr and elasticsearch. Kalooga offers a 
visual relevance service for online publishers and Nutch is an essential part 
of our technology stack."

"Nutch v2.0 is particularly exciting as it catches up with Apache projects like 
HBase, Cassandra, and Accumulo," added Nioche. "The community's response to the 
earlier versions of v2.0 has been very encouraging and we hope to see more and 
more people getting involved."

Availability and Oversight
Apache Nutch software is released under the Apache License v2.0, and is 
overseen by a self-selected team of active contributors to the project. A 
Project Management Committee (PMC) guides the Project's day-to-day operations, 
including community development and product releases. Apache Nutch source code, 
documentation, mailing lists, and related resources are available at 
http://nutch.apache.org/

About The Apache Software Foundation (ASF)
Established in 1999, the all-volunteer Foundation oversees nearly one hundred 
fifty leading Open Source projects, including Apache HTTP Server — the world's 
most popular Web server software. Through the ASF's meritocratic process known 
as "The Apache Way," more than 400 individual Members and 3,500 Committers 
successfully collaborate to develop freely available enterprise-grade software, 
benefiting millions of users worldwide: thousands of software solutions are 
distributed under the Apache License; and the community actively participates 
in ASF mailing lists, mentoring initiatives, and ApacheCon, the Foundation's 
official user conference, trainings, and expo. The ASF is a US 501(3)(c) 
not-for-profit charity, funded by individual donations and corporate sponsors 
including AMD, Basis Technology, Citrix, Cloudera, Facebook, GoDaddy, Google, 
IBM, HP, Hortonworks, Huawei, Matt Mullenweg, Microsoft, PSW Group, 
SpringSource, and Yahoo!. For more information,
 visit http://www.apache.org/.

"Apache", "Nutch", "Apache Nutch", "Accumulo", "Apache Accumulo", "Avro", 
"Apache Avro", "Cassandra", "Apache Cassandra", "Gora", "Apache Gora", 
"Hadoop", "Apache Hadoop", "HBase", "Apache HBase", "HDFS", Apache HDFS", 
"Solr", "Apache Solr", "Tika", "Apache Tika", and "ApacheCon" are trademarks of 
The Apache Software Foundation. All other brands and trademarks are the 
property of their respective owners.

#  #  #

Reply via email to