Difference between revisions of "Open Source Crawlers"
From Wiki History Database
NewUserName (Talk | contribs) |
NewUserName (Talk | contribs) |
||
Line 1: | Line 1: | ||
− | DataparkSearch | + | http://en.wikipedia.org/wiki/DataparkSearch is a crawler and search engine released under the GNU General Public License. |
− | + | ||
+ | http://en.wikipedia.org/wiki/ | ||
[[Wget|GNU Wget]] is a command-line operated crawler written in C programming language and released under the GNU General Public License GPL. It is typically used to mirror web and FTP sites. | [[Wget|GNU Wget]] is a command-line operated crawler written in C programming language and released under the GNU General Public License GPL. It is typically used to mirror web and FTP sites. | ||
Line 12: | Line 13: | ||
[http://www.iterating.com/products/JSpider JSpider] is a highly configurable and customizable web sSpider engine released under the GNU General Public License GPL. | [http://www.iterating.com/products/JSpider JSpider] is a highly configurable and customizable web sSpider engine released under the GNU General Public License GPL. | ||
− | [http://larbin.sourceforge.net/index-eng.html Larbin] | + | [http://larbin.sourceforge.net/index-eng.html Larbin] by Sebastien Ailleret |
− | [http://sourceforge.net/projects/webtools4larbin/ Webtools4larbin] | + | [http://sourceforge.net/projects/webtools4larbin/ Webtools4larbin] by Andreas Beder |
− | http://bithack.se/methabot/ Methabot] | + | http://bithack.se/methabot/ Methabot] is a speed-optimized web crawler and command line utility written in C programming language and released under a 2-clause BSD License. It features a wide configuration system, a module system and has support for targeted crawling through local filesystem, HTTP or FTP. |
[[Nutch]] is a crawler written in Java and released under an Apache License. It can be used in conjunction with the [[Lucene]] text indexing package. | [[Nutch]] is a crawler written in Java and released under an Apache License. It can be used in conjunction with the [[Lucene]] text indexing package. | ||
− | [http://dbpubs.stanford.edu:8091/~testbed/doc2/WebBase/webbase-pages.html#Spider WebVac] | + | [http://dbpubs.stanford.edu:8091/~testbed/doc2/WebBase/webbase-pages.html#Spider WebVac] is a crawler used by the [http://dbpubs.stanford.edu:8091/~testbed/doc2/WebBase/ Stanford WebBase Project]. |
− | [http://www.cs.cmu.edu/~rcm/websphinx/ WebSPHINX] | + | [http://www.cs.cmu.edu/~rcm/websphinx/ WebSPHINX] (Miller and Bharat, 1998) is composed of a Java class library that implements multi-threaded web page retrieval and HTML parsing, and a graphical user interface to set the starting URLs, to extract the downloaded data and to implement a basic text-based search engine. |
− | [http://www.cwr.cl/projects/WIRE/ WIRE - Web Information Retrieval Environment] | + | [http://www.cwr.cl/projects/WIRE/ WIRE - Web Information Retrieval Environment] (Baeza-Yates and Castillo, 2002) is a web crawler written in C++ and released under the GNU General Public License GPL, including several policies for scheduling the page downloads and a module for generating reports and statistics on the downloaded pages so it has been used for web characterization. |
− | [http://search.cpan.org/~marclang/ParallelUserAgent-2.57/lib/LWP/Parallel/RobotUA.pm LWP::RobotUA] | + | [http://search.cpan.org/~marclang/ParallelUserAgent-2.57/lib/LWP/Parallel/RobotUA.pm LWP::RobotUA] (Langheinrich , 2004) is a [[Perl]] class for implementing well-behaved parallel web robots distributed under [http://dev.perl.org/licenses/ Perl5's license]. |
− | [http://www.noviway.com/Code/Web-Crawler.aspx Web Crawler] | + | [http://www.noviway.com/Code/Web-Crawler.aspx Web Crawler] Open source web crawler. |
− | [http://www.ucw.cz/holmes/ Sherlock Holmes] | + | [http://www.ucw.cz/holmes/ Sherlock Holmes] Sherlock Holmes gathers and indexes textual data (text files, web pages, ...), both locally and over the network. Holmes is sponsored and commercially used by the Czech web portal [http://www.centrum.cz/ Centrum]. It is also used by [[Onet.pl]], displayed as: |
− | + | holmes/3.11 (OnetSzukaj/5.0; +http://szukaj.onet.pl) | |
− | [http://www.yacy.net/yacy/ YaCy] | + | [http://www.yacy.net/yacy/ YaCy] YaCy is a web crawler, indexer, web server with user interface to the application and the search page, and implements a peer-to-peer protocol to communicate with other YaCy installations. YaCy can be used as stand-alone crawler/indexer or as a distributed search engine. (licensed under GPL) |
− | [http://sourceforge.net/projects/ruya/ Ruya] | + | [http://sourceforge.net/projects/ruya/ Ruya] Ruya is an Open Source, high performance breadth-first, level-based web crawler. It is used to crawl English and Japanese websites in a well-behaved manner. It is released under the GNU General Public License GPL and is written entirely in the Python programming language. A [http://ruya.sourceforge.net/ruya.SingleDomainDelayCrawler-class.html SingleDomainDelayCrawler] implementation obeys robots.txt with a crawl delay. |
− | [http://uicrawler.sourceforge.net/ Universal Information Crawler] | + | [http://uicrawler.sourceforge.net/ Universal Information Crawler] Fast developing web crawler. Crawls Saves and analyzes the data. |
− | [http://www.agentkernel.com/ Agent Kernel] | + | [http://www.agentkernel.com/ Agent Kernel] A Java framework for schedule, thread, and storage management when crawling. |
[[Category:FOSS]] | [[Category:FOSS]] |