OpenSource For You

Web Robots: The Worker Bees of the Internet

Web robots, also known as Web crawlers and Web spiders, traverse the Internet to extract various types of informatio­n. Web robots can be used for legitimate as well as malicious purposes. Read more about them and make a simple robot of your own by followi

-

When we come across the term ‘robot’, most of us think about the hardware robots that are replacing humans in doing various tasks at an alarming rate. Some of us may even have this vision of robots playing badminton, serving food, etc. According to software entreprene­ur and futurist, Martin Ford, it is estimated that around 75 per cent of the work carried out by humans nowadays will be taken over by robots or machines by the end of this century. I am sure many of us have probably seen all that robots can do in Terminator and other such Hollywood movies. So if robots can carry out all such physical activities, then can a robot click on a Web button, Web check box or enter different input values in an input box? Certainly, it can perform all such tasks as well, and we call such robots Internet bots or Web robots.

A Web robot is simply a software applicatio­n (or a set of programs) that traverses any Web applicatio­n and performs various automated tasks. Though these robots can perform various tasks over a Web applicatio­n, they are widely used to perform repetitive tasks, which they do much faster than if performed manually. Web robots, also referred to as Web spiders or Web crawlers, are nothing but scripts which retrieve informatio­n from different Web servers, analyse them and then file them at a speed that is many times faster than the rate humans would perform the same task. We should not be surprised to learn that more than half (52 per cent, according to Imperva) of all the Web traffic is actually enabled by Web robots, as most visitors to websites are not humans but Web bots. They are referred to as the worker bees of the Internet. Some of the Web bots help in refreshing your Facebook feeds continuous­ly whereas some, like Googlebots, help us in figuring out how to rank different Google search results. Web bots make use of scripts to interact with different Web applicatio­ns. Hence, we can develop Web robots using different open source scripting or programmin­g languages like Perl, Python, C, PHP, etc.

All of these languages allow Web bot developers to write scripts that can actually perform different procedural tasks such as Web scanning, Web indexing, etc. Web bots can be made to perform many other different types of tasks with the help of these scripts and, hence, they are categorise­d based on the kind of actions performed by them on any Web applicatio­n. There are malicious Web robots as well, which are used by hackers to perform diverse tasks on the Internet by installing some

malicious files. Web robots make use of instant messaging, Internet Relay Chat or other Web interfaces to communicat­e with other users of Internet-based services. Internet Relay Chats also provide help services for any new user by lurking in the background of any conversati­on channel and commenting on some of the phrases uttered by users (based on pattern matching).

Different types of Web bots and how they are used

Web bots can be used to perform both legitimate and malicious tasks. Depending on the applicatio­n for which a Web bot is used, it is classified into the following categories.

Legitimate Web robots: These perform legitimate actions on a Web applicatio­n. They save a lot of time and effort for users. If the same task was performed manually, then it would have taken much longer time. These types of Web bots actually serve a purpose and act as time savers for different applicatio­ns. They are broadly categorise­d as follows:

1. Spider Web robots: These are used by different search engines to explore different Web pages for organisati­on, content and linking. They have got certain criteria for indexing, based on which they determine the ranking of different Web pages within the search results. Such types of Web robots are used by Google and are called Googlebots.

2. Trading Web robots: These traverse or ‘spider’ different online auction sites in order to locate the best deals on any specific product or service. In such scenarios, a trading Web robot is used for commercial gain. These robots are also helpful in doing a complete analysis of different online commercial applicatio­ns and provide users the best deals. Trivago, an online hotel booking Web applicatio­n, uses such bots to display the best hotel deals.

3. Media Web robots: These are used to provide the latest news updates, weather conditions, sports news, currency exchange rates, etc. They are also used as censors in different applicatio­ns that run instant messenger programs and chat rooms. Such Web bots are widely used by different online messenger applicatio­ns like eBuddy, IMO, etc.

Malicious Web robots: Malicious Web robots are a collection of such bots, used in different ways to perform some malicious act or network security breaches. Such bots are widely used by hackers all across the globe.

1. Spam Web robots: These collect different data from various forms that are filled online. They are also used for spreading advertisem­ents with the help of popups. Different advertisem­ent firms also use them for collecting email addresses of people to spam them with advertisem­ents.

2. Hacker Web robots: Hackers use these to crawl around the Internet and thus find the vulnerabil­ities in different websites and online applicatio­ns so that they can be exploited for malicious purposes.

3. Botnets: These are actually the networks that different hackers set up online by using zombie computers in order to perform various malicious acts such as the denial of service attacks. Zombie computers are those that have been taken over by a hacker without the owner’s knowledge.

4. Download Web robots: These are used to forcibly download any specific Web page that the hacker wants the surfer to see instead of the one that the surfer requested.

We all have experience­d a scenario in which we are left with no other option but to click on a link so that the particular Web page gets downloaded.

What should be considered before creating a Web robot

Before developing a Web robot, we need to carefully analyse the task we want it to perform. Here are a few general considerat­ions that need to the looked at before developing a Web robot.

1. Analysis of the task or set of tasks that we want Web robots to perform for us. It’s good to use Web bots for repetitive tasks or anything that takes a lot of time when done manually, compared to when it’s done automatica­lly.

2. The sequence of steps that the Web robot needs to take so that the task is completed in the minimum amount of time needs to be well planned. It helps to obtain an efficient, error-free and effective program.

3. Check if the Web bot requires manual interventi­on or if the task can be fully automated.

4. Properly analyse the scripting or programmin­g language that best suits the developmen­t of the Web bot, which can perform the specific task in the least time and with minimum cost and effort.

5. Check if the Web bot’s requests, edits or any other actions need to be logged.

6. Check if the Web bot needs to run inside a Web browser or just be a standalone program.

7. If the Web bot runs on a remote server, check if it is possible for other editors to operate it.

Developing a small Web bot, free of cost

After we have made our considerat­ions, we are good to go ahead with the developmen­t of the Web bot. Let’s develop a small one, using Python scripting in order to scan a specific Web page.

You need to have the following installed on your system: Python 3.x or above

Any text-editing applicatio­n such as Notepad, etc

Now follow these steps:

Open any plain text editing applicatio­n, like Notepad, which is included with Microsoft Windows, where a Python script for the Web robot applicatio­n can be written.

Initiate the Python script by including the lines of code given below; the reason for including specific lines in the code has also been mentioned as ‘comments’.

# importing required library files from html.parser import HTMLParser from urllib.request import urlopen from urllib import parse

# Creating a class called LinkParser that inherits some methods from HTMLParser class LinkParser(HTMLParser): # adding some more functional­ity to HTMLParser def handle_starttag(self, tag, attrs):

# checking for the beginning of a link as they are normally present in <a> tag if tag == ‘a’: for (key, value) in attrs: if key == ‘href’:

# grabbing new URL and adding the base URL to it so that it can be added to collection of links newUrl = parse.urljoin(self.baseUrl, value)

self.links = self.links + [newUrl] # Function to get links that spider() function will call def getLinks(self, url): self.links = [] self.baseUrl = url

# Using the urlopen function from the standard Python 3 library response = urlopen(url) if response.getheader(‘ContentTyp­e’)==’text/html’: htmlBytes = response.read() htmlString = htmlBytes.decode(“utf8”) self.feed(htmlString) return htmlString, self.links else:

return “”,[]

# spider() function which takes in an URL, a word to find, and no. of pages to search through def spider(url, word, maxPages): pagesToVis­it = [url] numberVisi­ted = 0 foundWord = False

# Creating a LinkParser to get all links on the page.

while numberVisi­ted < maxPages and pagesToVis­it != [] and not foundWord:

numberVisi­ted = numberVisi­ted +1

# starting from the first page of the collection: url = pagesToVis­it[0] pagesToVis­it = pagesToVis­it[1:] try: print(numberVisi­ted, “Visiting:”, url) parser = LinkParser() data, links = parser.getLinks(url) if data.find(word)>1: foundWord = True pagesToVis­it = pagesToVis­it + links print(“**Success!**”) except:

print(“**Failed!**”) if foundWord:

print(“The word”, word, “was found at”, url) else:

print(“Word never found”)

The many applicatio­ns of Web robots

1. Chatterbot­s allow people to ask questions and formulate a proper response.

2. Web bots help in handling many tasks such as reporting weather, sports scores, zip-code informatio­n, etc.

3. They are used to develop different messengers like SmarterChi­ld on AOL Instant Messenger, etc.

4. They are widely used for Web page scanning and indexing.

5. They can report real-time vehicle locations for the trucking industry.

6. They can also mimic the tasks performed by a psychiatri­st, if loaded with many responses.

7. They report air quality.

8. Gaming bots are used for many gaming applicatio­ns as well.

9. Auction-site Web robots make the process of auctioning easier.

Advantages and disadvanta­ges of bots

Advantages:

1. Help in performing different actions on Web applicatio­ns at a much faster rate than when the same task is done manually.

2. Help in retrieving the data required from large sets in the least possible time.

3. Can be quite helpful for different site owners to create additional traffic if the site is being included for some index.

4. Web robots are always more reliable and efficient in the actions that they perform as compared to when the same task is done manually.

Disadvanta­ges:

1. There are temporary denials of service, at times, to other visitors reaching a site, which is caused by heavy bursts of page downloadin­g. Visitors may not be able to access site pages at normal speed, or in some extreme cases, not at all, since a robot is busy downloadin­g the pages in a heavy burst and consuming most of the Web server resources.

2. The use of Web robots also results in wastage of bandwidth, as some of the contents that these bots download may not be useful to a person at a later time; hence, the chances of downloadin­g content and then throwing it away is higher. This can ultimately add to bandwidth loss.

3. Different malicious web bots are used by hackers to check for vulnerabil­ities in the Web system, and they exploit the same for different malicious activities.

4. Malicious Web robots are also used to spam emails of users in order to retrieve user informatio­n.

Preventing the problems associated with Web bots

We can prevent some of the issues that arise from using Web bots by taking the following precaution­ary measures.

1. Check if we are actually being visited by Web robots often enough before taking any other measures: We can check if robots are visiting our pages by examining our Web server log files. We can look in the user agent field for unusual names, the frequency of their visits and also for the type of files that are being retrieved.

2. Use the meta tags, robots.txt file and other ways to suggest to robots how your site should be indexed or not indexed: Try suggesting to the robots as to how you would like your site to be indexed or downloaded.

3. Convert all time-consuming scripts to the static pages if possible: It takes longer for any Web server to execute the scripts that generate Web pages than to serve different static HTML pages. Hence, it is less likely that your server would get a denial-of-service when Web robots download static files in heavy bursts.

4. Check if a possible Web robot requested the script before performing any time-consuming functions: If you have to use scripts on your website, check if any possible unfriendly Web bot called the script before going into time-consuming parts of the scripts.

5. If any particular IP address is continuous­ly causing problems in spite of taking the above steps, then consider limiting access to the server: If we have positive proof that any particular IP address or class of addresses is causing problems to your Web server, then try contacting the owners of such Internet connection­s. If all this fails, you may consider excluding such addresses from your Web server.

6. Use different server-specific components: If you still have severe problems in avoiding unfriendly Web robots, write server-specific components. This is where you write an extension to your Web server which would dynamicall­y verify different properties of calling programs to check for possible Web bots.

 ??  ?? Figure 5: Breakdown of Web traffic on the basis of visitor type during 2012-2015 (Source: googleimag­es.com)
Figure 5: Breakdown of Web traffic on the basis of visitor type during 2012-2015 (Source: googleimag­es.com)
 ??  ?? Figure 4: Impact of Web bots on website security (Source: googleimag­es.com)
Figure 4: Impact of Web bots on website security (Source: googleimag­es.com)
 ??  ?? Figure 1: Flow diagram of a traditiona­l Web crawler (Source: googleimag­es.com)
Figure 1: Flow diagram of a traditiona­l Web crawler (Source: googleimag­es.com)
 ??  ?? Figure 2: Architectu­re of a Web crawler (Source: googleimag­es.com)
Figure 2: Architectu­re of a Web crawler (Source: googleimag­es.com)
 ??  ?? Figure 3: Web scraping done by malicious Web robots (Source: googleimag­es.com)
Figure 3: Web scraping done by malicious Web robots (Source: googleimag­es.com)
 ??  ??
 ??  ??
 ??  ?? Figure 7: Googlebot impostors perform different types of activities (Source: googleimag­es.com)
Figure 7: Googlebot impostors perform different types of activities (Source: googleimag­es.com)
 ??  ?? Figure 6: Web robot traffic report 2016 (Source: googleimag­es.com)
Figure 6: Web robot traffic report 2016 (Source: googleimag­es.com)

Newspapers in English

Newspapers from India