UOW SIM BSSP/DSS/BIT: Web Crawler

A web crawler, also known as web spider or bot is a computer program that search the world wide web in a methodical manner or in an orderly fashion. It can be used for various purposes, be it good or bad. It can be used to check if your website has any broken links for good sake. It also can be used to collect the valid email addresses from victim website (for spamming later, probably). Worse, it can be used to overload the target server since the downloading activity is involved the machine speed with zero delay.

There are many considerations to create an effective web crawler. I am just going to give for a simple web crawler that a novice can understand the concept and start to implement. The program has to start with a valid URL, it is called "seed".

The program will used the seed to download the page. For example, the seed is www.channelnewsasia.com and the program will download the Channel NewsAsia index page. After download is completed, then the program will open the downloaded file and search for other valid URLs in the file contents. If you are familiar with HTML coding, you knew that the best known keyword in a website for valid URL begins with and end with . The program will search for all the valid URL in the file and save all these valid URL into its memory (probably some seed files).

After first file read is over, the program will again take the next valid URL that were extracted from the index page to download a new valid URL and repeat the search and save the valid URL into its memory.

Slowly, the program will collect all the valid URL that are corresponding to the original seed www.channelnewsasia.com and able to normalize the Channel NewsAsia website structure.

The web crawler could collect all sorts of URL data including emails, documents and pdf whichever available in the website. It could also reveal the pages that are supposed to isolate from normal users due to the poor coding technique.

The drawback of this web crawler is that it requires to download the web pages to analyze for valid URL or links in the web server. You may want to use the threading to create more downloaders to increase the speed of the program.

There are also rising concerns that the web search engines such as Google, Yahoo, MSN are spamming the web traffics. The web crawler might be also used to attack with the DOS concept.

Reference:
http://en.wikipedia.org/wiki/Web_crawler

2 comments:

AnonymousJanuary 22, 2011 at 2:53 PM
I have engaged his service to tutor me privately for the Crytography lessons. He has excellent knowledge and professional attitude, even has his own sets of summarised notes for me. I will give 2 thumbs up and will definitely recommend him to my friends who kept failing.

Thank you, I PASSED!
Kelvin
4teeJanuary 22, 2011 at 10:06 PM
I am so glad to hear that, Kelvin. :)