The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. ex. Some numerals are expressed as "XNUMX".
Copyrights notice
The original paper is in English. Non-English content has been machine-translated and may contain typographical errors or mistranslations. Copyrights notice
L'informatique sur Internet est proposée pour exploiter les ressources informatiques personnelles sur Internet afin de créer des applications Web à grande échelle à moindre coût. Dans cet article, un modèle d'exploration Web distribué basé sur DHT et basé sur le concept d'informatique Internet est proposé. Aussi, nous proposons deux optimisations pour réduire le temps de téléchargement et le temps d'attente des tâches d'exploration Web afin d'augmenter le débit et le taux de mise à jour du système. Sur la base de notre système de téléchargement convivial pour les contributeurs, l'amélioration du temps de téléchargement est obtenue en raccourcissant les RTT crawler-crawlee. Afin d'estimer avec précision les RTT, un système de coordonnées de réseau est combiné avec le DHT sous-jacent. L'amélioration du temps d'attente est obtenue en redirigeant les tâches d'exploration entrantes vers des robots d'exploration peu chargés afin de maintenir la file d'attente sur chaque robot d'exploration de taille égale. Nous proposons également une méthode simple de partitionnement de site Web pour diviser un grand site Web en morceaux plus petits afin de réduire la granularité des tâches. Toutes les méthodes proposées sont évaluées à travers des tests réels sur Internet et des simulations montrant des résultats satisfaisants.
The copyright of the original papers published on this site belongs to IEICE. Unauthorized use of the original or translated papers is prohibited. See IEICE Provisions on Copyright for details.
Copier
Xiao XU, Weizhe ZHANG, Hongli ZHANG, Binxing FANG, "Efficient Distributed Web Crawling Utilizing Internet Resources" in IEICE TRANSACTIONS on Information,
vol. E93-D, no. 10, pp. 2747-2762, October 2010, doi: 10.1587/transinf.E93.D.2747.
Abstract: Internet computing is proposed to exploit personal computing resources across the Internet in order to build large-scale Web applications at lower cost. In this paper, a DHT-based distributed Web crawling model based on the concept of Internet computing is proposed. Also, we propose two optimizations to reduce the download time and waiting time of the Web crawling tasks in order to increase the system's throughput and update rate. Based on our contributor-friendly download scheme, the improvement on the download time is achieved by shortening the crawler-crawlee RTTs. In order to accurately estimate the RTTs, a network coordinate system is combined with the underlying DHT. The improvement on the waiting time is achieved by redirecting the incoming crawling tasks to light-loaded crawlers in order to keep the queue on each crawler equally sized. We also propose a simple Web site partition method to split a large Web site into smaller pieces in order to reduce the task granularity. All the methods proposed are evaluated through real Internet tests and simulations showing satisfactory results.
URL: https://global.ieice.org/en_transactions/information/10.1587/transinf.E93.D.2747/_p
Copier
@ARTICLE{e93-d_10_2747,
author={Xiao XU, Weizhe ZHANG, Hongli ZHANG, Binxing FANG, },
journal={IEICE TRANSACTIONS on Information},
title={Efficient Distributed Web Crawling Utilizing Internet Resources},
year={2010},
volume={E93-D},
number={10},
pages={2747-2762},
abstract={Internet computing is proposed to exploit personal computing resources across the Internet in order to build large-scale Web applications at lower cost. In this paper, a DHT-based distributed Web crawling model based on the concept of Internet computing is proposed. Also, we propose two optimizations to reduce the download time and waiting time of the Web crawling tasks in order to increase the system's throughput and update rate. Based on our contributor-friendly download scheme, the improvement on the download time is achieved by shortening the crawler-crawlee RTTs. In order to accurately estimate the RTTs, a network coordinate system is combined with the underlying DHT. The improvement on the waiting time is achieved by redirecting the incoming crawling tasks to light-loaded crawlers in order to keep the queue on each crawler equally sized. We also propose a simple Web site partition method to split a large Web site into smaller pieces in order to reduce the task granularity. All the methods proposed are evaluated through real Internet tests and simulations showing satisfactory results.},
keywords={},
doi={10.1587/transinf.E93.D.2747},
ISSN={1745-1361},
month={October},}
Copier
TY - JOUR
TI - Efficient Distributed Web Crawling Utilizing Internet Resources
T2 - IEICE TRANSACTIONS on Information
SP - 2747
EP - 2762
AU - Xiao XU
AU - Weizhe ZHANG
AU - Hongli ZHANG
AU - Binxing FANG
PY - 2010
DO - 10.1587/transinf.E93.D.2747
JO - IEICE TRANSACTIONS on Information
SN - 1745-1361
VL - E93-D
IS - 10
JA - IEICE TRANSACTIONS on Information
Y1 - October 2010
AB - Internet computing is proposed to exploit personal computing resources across the Internet in order to build large-scale Web applications at lower cost. In this paper, a DHT-based distributed Web crawling model based on the concept of Internet computing is proposed. Also, we propose two optimizations to reduce the download time and waiting time of the Web crawling tasks in order to increase the system's throughput and update rate. Based on our contributor-friendly download scheme, the improvement on the download time is achieved by shortening the crawler-crawlee RTTs. In order to accurately estimate the RTTs, a network coordinate system is combined with the underlying DHT. The improvement on the waiting time is achieved by redirecting the incoming crawling tasks to light-loaded crawlers in order to keep the queue on each crawler equally sized. We also propose a simple Web site partition method to split a large Web site into smaller pieces in order to reduce the task granularity. All the methods proposed are evaluated through real Internet tests and simulations showing satisfactory results.
ER -