Difference between revisions of "Open Source Crawlers"

@@ Line 1: / Line 1: @@
-DataparkSearch
+http://en.wikipedia.org/wiki/DataparkSearch is a crawler and search engine released under the GNU General Public License.
- is a crawler and search engine released under the [[GNU General Public License]].
+http://en.wikipedia.org/wiki/
 [[Wget|GNU Wget]] is a command-line operated crawler written in C programming language and released under the GNU General Public License GPL.  It is typically used to mirror web and FTP sites.
@@ Line 12: / Line 13: @@
 [http://www.iterating.com/products/JSpider JSpider] is a highly configurable and customizable web sSpider engine released under the GNU General Public License GPL.
-[http://larbin.sourceforge.net/index-eng.html Larbin]''' by Sebastien Ailleret
+[http://larbin.sourceforge.net/index-eng.html Larbin] by Sebastien Ailleret
-[http://sourceforge.net/projects/webtools4larbin/ Webtools4larbin]''' by Andreas Beder
+[http://sourceforge.net/projects/webtools4larbin/ Webtools4larbin] by Andreas Beder
-http://bithack.se/methabot/ Methabot]''' is a speed-optimized web crawler and command line utility written in C programming language and released under a 2-clause BSD License. It features a wide configuration system, a module system and has support for targeted crawling through local filesystem, HTTP or FTP.
+http://bithack.se/methabot/ Methabot] is a speed-optimized web crawler and command line utility written in C programming language and released under a 2-clause BSD License. It features a wide configuration system, a module system and has support for targeted crawling through local filesystem, HTTP or FTP.
 [[Nutch]] is a crawler written in Java and released under an Apache License. It can be used in conjunction with the [[Lucene]] text indexing package.
-[http://dbpubs.stanford.edu:8091/~testbed/doc2/WebBase/webbase-pages.html#Spider WebVac]''' is a crawler used by the [http://dbpubs.stanford.edu:8091/~testbed/doc2/WebBase/ Stanford WebBase Project].
+[http://dbpubs.stanford.edu:8091/~testbed/doc2/WebBase/webbase-pages.html#Spider WebVac] is a crawler used by the [http://dbpubs.stanford.edu:8091/~testbed/doc2/WebBase/ Stanford WebBase Project].
-[http://www.cs.cmu.edu/~rcm/websphinx/ WebSPHINX]''' (Miller and Bharat, 1998) is composed of a Java class library that implements multi-threaded web page retrieval and HTML parsing, and a graphical user interface to set the starting URLs, to extract the downloaded data and to implement a basic text-based search engine.
+[http://www.cs.cmu.edu/~rcm/websphinx/ WebSPHINX] (Miller and Bharat, 1998) is composed of a Java class library that implements multi-threaded web page retrieval and HTML parsing, and a graphical user interface to set the starting URLs, to extract the downloaded data and to implement a basic text-based search engine.
-[http://www.cwr.cl/projects/WIRE/ WIRE - Web Information Retrieval Environment]''' (Baeza-Yates and Castillo, 2002) is a web crawler written in C++ and released under the [[GNU General Public License|GPL]], including several policies for scheduling the page downloads and a module for generating reports and statistics on the downloaded pages so it has been used for web characterization.
+[http://www.cwr.cl/projects/WIRE/ WIRE - Web Information Retrieval Environment] (Baeza-Yates and Castillo, 2002) is a web crawler written in C++ and released under the GNU General Public License GPL, including several policies for scheduling the page downloads and a module for generating reports and statistics on the downloaded pages so it has been used for web characterization.
-[http://search.cpan.org/~marclang/ParallelUserAgent-2.57/lib/LWP/Parallel/RobotUA.pm LWP::RobotUA]''' (Langheinrich , 2004) is a [[Perl]] class for implementing well-behaved parallel web robots distributed under [http://dev.perl.org/licenses/ Perl5's license].
+[http://search.cpan.org/~marclang/ParallelUserAgent-2.57/lib/LWP/Parallel/RobotUA.pm LWP::RobotUA] (Langheinrich , 2004) is a [[Perl]] class for implementing well-behaved parallel web robots distributed under [http://dev.perl.org/licenses/ Perl5's license].
-[http://www.noviway.com/Code/Web-Crawler.aspx Web Crawler]''' Open source web crawler.
+[http://www.noviway.com/Code/Web-Crawler.aspx Web Crawler] Open source web crawler.
-[http://www.ucw.cz/holmes/ Sherlock Holmes]''' Sherlock Holmes gathers and indexes textual data (text files, web pages, ...), both locally and over the network. Holmes is sponsored and commercially used by the Czech web portal [http://www.centrum.cz/ Centrum].  It is also used by [[Onet.pl]], displayed as:
+[http://www.ucw.cz/holmes/ Sherlock Holmes] Sherlock Holmes gathers and indexes textual data (text files, web pages, ...), both locally and over the network. Holmes is sponsored and commercially used by the Czech web portal [http://www.centrum.cz/ Centrum].  It is also used by [[Onet.pl]], displayed as:
-   holmes/3.11 (OnetSzukaj/5.0; +http://szukaj.onet.pl)
+holmes/3.11 (OnetSzukaj/5.0; +http://szukaj.onet.pl)
-[http://www.yacy.net/yacy/ YaCy]''' YaCy is a web crawler, indexer, web server with user interface to the application and the search page, and implements a peer-to-peer protocol to communicate with other YaCy installations. YaCy can be used as stand-alone crawler/indexer or as a distributed search engine. (licensed under GPL)
+[http://www.yacy.net/yacy/ YaCy] YaCy is a web crawler, indexer, web server with user interface to the application and the search page, and implements a peer-to-peer protocol to communicate with other YaCy installations. YaCy can be used as stand-alone crawler/indexer or as a distributed search engine. (licensed under GPL)
-[http://sourceforge.net/projects/ruya/ Ruya]''' Ruya is an Open Source, high performance [[Breadth-first_search|breadth-first]], level-based web crawler. It is used to crawl English and Japanese websites in a well-behaved manner. It is released under the [[GNU General Public License|GPL]] and is written entirely in the [[Python (programming language)|Python]] language. A '''[http://ruya.sourceforge.net/ruya.SingleDomainDelayCrawler-class.html SingleDomainDelayCrawler]''' implementation obeys robots.txt with a crawl delay.
+[http://sourceforge.net/projects/ruya/ Ruya] Ruya is an Open Source, high performance breadth-first, level-based web crawler. It is used to crawl English and Japanese websites in a well-behaved manner. It is released under the GNU General Public License GPL and is written entirely in the Python programming language. A [http://ruya.sourceforge.net/ruya.SingleDomainDelayCrawler-class.html SingleDomainDelayCrawler] implementation obeys robots.txt with a crawl delay.
-[http://uicrawler.sourceforge.net/ Universal Information Crawler]''' Fast developing web crawler. Crawls Saves and analyzes the data.
+[http://uicrawler.sourceforge.net/ Universal Information Crawler] Fast developing web crawler. Crawls Saves and analyzes the data.
-[http://www.agentkernel.com/ Agent Kernel]''' A Java framework for schedule, thread, and storage management when crawling.
+[http://www.agentkernel.com/ Agent Kernel] A Java framework for schedule, thread, and storage management when crawling.
 [[Category:FOSS]]

Difference between revisions of "Open Source Crawlers"

Revision as of 12:16, 31 October 2007

Navigation menu

Personal tools

Namespaces

Variants

Views

Actions

Search

Navigation

Toolbox