Linux Format

Big Datasets

-

The key to machine learning is lots of data. Getting hold of big data and being able to process it used to be a challenge, but today’s commonplac­e fat connection­s and gargantuan hard drives make this much easier. Besides the MNIST numerals there are a number of other useful datasets that are available for free. The 14GB wikipedia text makes a fine corpus for natural language-processing projects.

The DBpedia project goes even further, providing a Java-powered framework for extracting more structured informatio­n from Wikipedia and related projects. They provide a pre-extracted knowledge base comprising 4.58 million things, most of which are classified using a carefully crafted and consistent ontology. This makes it easy to process relationsh­ips between things, or to restrict to a particular category of interest, be it people, places or plants.

IMDB provide a number of datasets ( www.imdb.com/interfaces) for non-commercial use, including a large collection of images of film stars. Images of the most popular 100,000, sourced from both IMDB and Wikipedia, were used by a team from Zurich University to create the largest public database of faces in the world. This was used to train their DEX (Deep EXpectatio­n) system, which estimates age and attractive­ness, and proved an internet hit. Learn more and download the data at https://data.vision.ee.ethz.ch/cvl/rrothe/imdb-wiki.

 ??  ?? The UK government provides lots of datasets at https://data. gov.uk. Here’s one about flood data.
The UK government provides lots of datasets at https://data. gov.uk. Here’s one about flood data.

Newspapers in English

Newspapers from Australia