dimanche 1 février 2009

Chapter 1: Introduction of the topic background

I will not surprise you if I say that Internet has been created to share
information and to communicate with each others.
It is hard to evaluate how big is the Internet, estimations among companies
are very different, it varies from 15 to some 30 billion Web pages1. The number of
websites is increasing everyday and estimated at 185,167,8972 with a constant
augmentation since the creation of the world wide web.

Illustration 1: Total Sites Across All Domains August 1995 - January 2009

Habits have changed since the creation of the Internet and websites are used now in
diverse manners if it comes to be a standard for companies (recognized as a mark of
trust, seriousness and quality) it is also a space for many individuals (blog
phenomenon). As an example regarding France, in June 2008 14% of French people
above 12 year-old which means 22% of French Internet users are authors of a blog or
a website3.
The banalization of the Internet and the fact that anyone can create his own
website for free increase the feeling we have regarding the Internet: a true jungle of
information and even sometimes real “dump” regarding information accuracy.
Websites can be accessible through three channels:
· Direct access (for example you know the website address by heart, you put it
in your favorites or you find a website on a business card and you are typing
it in the address bar);
· External links (you access to a website which has the link of another
website, this is the case in most of websites, catalogs, advertisement);
· Through Search Engines (you use a dedicated application by typing in some
keywords in order to get suggestions of what you are looking for);
As you can see from this list if you use only the first two ways to crawl the
web it comes to be too rigid and not wide enough. It has been said as well that the
first way is disappearing more and more in profit of search engines4.
So one could say that there is currently two main ways to crawl the web, from
link to link and by using search engine.
This last one being indispensable in order to crawl the web properly.
More and more information are put on the Internet which makes it
come a true jungle. The only way to crawl those information properly
is to use search engines.

