Big Datasets
The key to machine learning is lots of data. Getting hold of big data and being able to process it used to be a challenge, but today’s commonplace fat connections and gargantuan hard drives make this much easier. Besides the MNIST numerals there are a number of other useful datasets that are available for free. The 14GB wikipedia text makes a fine corpus for natural language-processing projects.
The DBpedia project goes even further, providing a Java-powered framework for extracting more structured information from Wikipedia and related projects. They provide a pre-extracted knowledge base comprising 4.58 million things, most of which are classified using a carefully crafted and consistent ontology. This makes it easy to process relationships between things, or to restrict to a particular category of interest, be it people, places or plants.
IMDB provide a number of datasets ( www.imdb.com/interfaces) for non-commercial use, including a large collection of images of film stars. Images of the most popular 100,000, sourced from both IMDB and Wikipedia, were used by a team from Zurich University to create the largest public database of faces in the world. This was used to train their DEX (Deep EXpectation) system, which estimates age and attractiveness, and proved an internet hit. Learn more and download the data at https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki.