五步轻松替代JSP Struts与Velocity集成 - -| 回首页 | 2005年索引 | - -java正则表达式

Open Source Web Crawlers Written in Java- -

                                      

今天项目需要,想找一个简单易用的java编写的网页数据抓取工具。

结果翻到了这些,好像挺好的东西,不过不简单。暂时记录下,以后再看。

源自:http://www.manageability.org/blog/stuff/open-source-web-crawlers-java/view

I was recently quite pleased to learn that the Internet Archive's new crawler is written in Java. Coincindentally, I had in addition to put together a list of open source projects for full-text search engines, I put together a list of crawlers written in Java to complement that list. Here's the list:

这段时间我非常高兴地看到互联网的文档抓取工具使用Java编写的。另外,我为完成一个全文搜索引擎开源项目列表,整理了一个用Java编写的抓取工具列表。下面就是这个列表:

I was recently quite pleased to learn that the Internet Archive's new crawler is written in Java. Coincindentally, I had in addition to put together a list of open source projects for full-text search engines, I put together a list of crawlers written in Java to complement that list. Here's the list:

  1. Heritrix - Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project. Heritrix is designed to respect the robots.txt exclusion directives and META robots tags .
  2. WebSPHINX - WebSPHINX ( Website-Specific Processors for HTML INformation eXtraction) is a Java class library and interactive development environment for web crawlers. A web crawler (also called a robot or spider) is a program that browses and processes Web pages automatically. WebSPHINX consists of two parts: the Crawler Workbench and the WebSPHINX class library.
  3. Nutch - Nutch provides a transparent alternative to commercial web search engines. As of June, 2003, we have successfully built a 100 million page demo system. Uses Lucene for its indexing, however provides its own Crawler implementation.
  4. WebLech - WebLech is a fully featured web site download/mirror tool in Java, which supports many features required to download websites and emulate standard web-browser behaviour as much as possible. WebLech is multithreaded and will feature a GUI console.
  5. Arale - While many bots around are focused on page indexing, Arale is primarly designed for personal use. It fits the needs of advanced web surfers and web developers.
  6. J-Spider - Based on the book "Programming Spiders, Bots and Aggregators in Java". This book begins by showing how to create simple bots that will retrieve information from a single website. Then a spider is developed that can move from site to site as it crawls across the Web. Next we build aggregators that can take data from many sites and present a consolidated view.
  7. HyperSpider - HyperSpider (Java app) collects the link structure of a website. Data import/export from/to database and CSV-files. Export to Graphviz DOT, Resource Description Framework (RDF/DC), XML Topic Maps (XTM), Prolog, HTML. Visualization as hierarchy and map.
  8. Arachnid - Arachnid is a Java-based web spider framework. It includes a simple HTML parser object that parses an input stream containing HTML content. Simple Web spiders can be created by sub-classing Arachnid and adding a few lines of code called after each page of a Web site is parsed.
  9. Spindle- spindle is a web indexing/search tool built on top of the Lucene toolkit. It includes a HTTP spider that is used to build the index, and a search class that is used to search the index. In addition, support is provided for the Bitmechanic listlib JSP TagLib, so that a search can be added to a JSP based site without writing any Java classes.
  10. Spider - Spider is a complete standalone Java application designed to easily integrate varied datasources. XML driven framework for data retrieval from network accessible sources, scheduled pulling, highly extensible, provides hooks for custom post-processing and configuration and implemented as a Avalon/Keel framework datafeed service.
  11. LARM - LARM is a 100% Java search solution for end-users of the Jakarta Lucene search engine framework. It contains methods for indexing files, database tables, and a crawler for indexing web sites. Well, it will be. At the moment we only have some specifications. It's up to you to turn this into a working program. Its predecessor was an experimental crawler called larm-webcrawler available from the Jakarta project.
  12. Metis - Metis is a tool to collect information from the content of web sites. This was written for the Ideahamster Group for finding the competitive intelligence weight of a web server and assists in satisfying the CI Scouting portion of the Open Source Security Testing Methodology Manual (OSSTMM).
  13. SimpleSpider - The simple spider is a real application to provide the search capability for DevelopMentor's web site. It is also an example application, for classroom use learning about open source programming with Java.
  14. Grunk - Grunk (for GRammar UNderstanding Kernel) is a library for parsing and extracting structured metadata from semi-structured text formats. It is based on a very flexible parsing engine capable of detecting a wide variety of patterns in text formats and extracting information from them. Formats are described in a simple and powerful XML configuration from which Grunk builds a parser at runtime, so adapting Grunk to a new format does not require a coding or compilation step. Not really a crawler, but something that may prove extremely useful in crawling.
  15. CAPEK - CAPEK is an Open Source robot entirely written in Java. It gathers web pages for EGOTHOR in a sophisticated way. The pages are ordered by their pagerank, stability of the connection between Capek and the respective web-site, and many other factors.

Again, I may have missed something that should be in this list. Please let me know!

- 作者: Hibernate 2005年11月1日, 星期二 12:41

Trackback

你可以使用这个链接引用该篇日志 http://publishblog.blogdriver.com/blog/tb.b?diaryID=1031912

回复

评论内容: