Html2text
h he mammoth th sizei of f th the W Worldld Wid Wide W Web b withith it its explosive growth, both in terms of its contents and its variety, poses certain important challenges to end users. Searching for a required piece of information from large volumes is one of the key problems that the average Web user faces on a daily basis. We have moved from the era of ‘information scarcity’ to the era of ‘information overload’. Almost any keyword that you fire into a search engine returns results ranging in number from a few hundreds to many thousands and even a few million. Fetching the data required to satisfy the user’s information needs is one of the challenging research issues.
An important solution to this problem of information overload is to extract the contents from Web resources and present them in a form that is comparatively simpler for the user to understand. This article presents Python based solutions that handle Web content extraction. There is an array of libraries available in Python for content extraction, as illustrated in Figure 1.
The focus of this article is to introduce the following five power-packed Python libraries, each of which has some unique features: html2text Lassie Newspaper Python-Goose Textract The major task performed by the html2text library is the conversion of HTML pages into an easy-to-read plain text format. This library was initiated by Aaron Swartz. The installation of html2text can be done easily, with the following command:
After successfully completing the installation process, the html2text library can be used either from the terminal or within a Python program. The terminal usage of html2text is shown in the following code: For example, the following command will fetch the text version of the Open Source For You website’s home page: This can be achieved through a Python program also, as shown below: