OpenSource For You

Html2text

-

h he mammoth th sizei of f th the W Worldld Wid Wide W Web b withith it its explosive growth, both in terms of its contents and its variety, poses certain important challenges to end users. Searching for a required piece of informatio­n from large volumes is one of the key problems that the average Web user faces on a daily basis. We have moved from the era of ‘informatio­n scarcity’ to the era of ‘informatio­n overload’. Almost any keyword that you fire into a search engine returns results ranging in number from a few hundreds to many thousands and even a few million. Fetching the data required to satisfy the user’s informatio­n needs is one of the challengin­g research issues.

An important solution to this problem of informatio­n overload is to extract the contents from Web resources and present them in a form that is comparativ­ely simpler for the user to understand. This article presents Python based solutions that handle Web content extraction. There is an array of libraries available in Python for content extraction, as illustrate­d in Figure 1.

The focus of this article is to introduce the following five power-packed Python libraries, each of which has some unique features: html2text Lassie Newspaper Python-Goose Textract The major task performed by the html2text library is the conversion of HTML pages into an easy-to-read plain text format. This library was initiated by Aaron Swartz. The installati­on of html2text can be done easily, with the following command:

After successful­ly completing the installati­on process, the html2text library can be used either from the terminal or within a Python program. The terminal usage of html2text is shown in the following code: For example, the following command will fetch the text version of the Open Source For You website’s home page: This can be achieved through a Python program also, as shown below:

 ??  ??

Newspapers in English

Newspapers from India