Textract
The Web includes resources apart from HTML pages. Textract is a library to extract data from those resource file formats. In simple terms, it can be described as a library to extract text from any type of file—from resources such as Word documents, PowerPoint presentations, PDFs, etc. Textract attempts to extract text from gif, jpg, mp3, ogg, tiff, xls, etc, and has various dependencies to handle these file formats. To install it, use the following commands: