OpenSource For You

Textract

-

The Web includes resources apart from HTML pages. Textract is a library to extract data from those resource file formats. In simple terms, it can be described as a library to extract text from any type of file—from resources such as Word documents, PowerPoint presentati­ons, PDFs, etc. Textract attempts to extract text from gif, jpg, mp3, ogg, tiff, xls, etc, and has various dependenci­es to handle these file formats. To install it, use the following commands:

Newspapers in English

Newspapers from India