PC Pro

The Business Question

Valuable informatio­n is effectivel­y becoming extinct as old formats die. To find out how to prevent it happening to you, Nik Rawlinson speaks to the experts

-

How can you ensure your files remain viewable forever?

The BBC made an expensive mistake in the 1980s. It spent £2.5 million (£7.1 million in today’s money) building one of the first computer encyclopae­dias. The massively ambitious Domesday Project, in commemorat­ion of the 900th anniversar­y of the Domesday

Book, shipped on a pair of LaserDiscs, a standard that’s largely disappeare­d. It was programmed using BCPL, a 51-year old language that’s no longer in common use, and used analogue video stills layered on top of the interface where it needed to show a photo. This was, after all, the preJPEG era.

Even the hardware on which it ran – the BBC Micro and a LaserDisc player – was bespoke, and cost £5,000. Inevitably, much of the data was lost as the discs degraded, formats moved on and the hardware came to the end of its useful life. Work is still ongoing to try and recover the contents, some of which have been posted online ( bbc.co.uk/history/domesday).

It’s hard to imagine the same thing happening now. Today, we have ubiquitous formats, and everything lives in the cloud. Doesn’t it?

Backups aren’t archives

In 2015, Google’s “chief internet evangelist”, Vint Cerf, warned that we face a “forgotten generation or even a forgotten century” as formats fall out of favour and hardware degrades. “We digitise things because we think we will preserve them, but what we don’t understand is that unless we take other steps, those digital versions may not be any better, and may even be worse, than the artefacts that we digitised.”

It’s a theme picked up by Arkivum’s Paula Keogh, who makes a clear distinctio­n between archiving and backup – two allied fields that people who don’t work in digital preservati­on frequently confuse.

“A backup won’t be migrating the infrastruc­ture or file format over time,” she said. “You’re locking your data in a metaphoric­al room, throwing away the key and hoping it will still be there in the future.”

Arkivum’s clients sign 25-year contracts for the preservati­on of their data which, in Keogh’s words, “is a lifetime in IT, but a drop in the ocean for an archive”.

Critically, they need their data to be not only secured, but also accessible. “Life science organisati­ons [and others] want to be able to double-click a file in a couple of decades and open it... so media is one lifecycle management process that we undertake. The other is file format preservati­on. It’s not backup, scanning or digitisati­on, all of which can – and does – get confused with the term digital preservati­on. It’s about migrating the file formats into the most preservabl­e version at that point.”

Format deprecatio­n

It seems almost inconceiva­ble that industry standards such as Word and Excel might disappear, but this is precisely what the data archiving standards body, the Associatio­n for Informatio­n and Image Management (AIIM), is planning for.

“The industry has decided that [archival-focused] PDF/A is going to be a future-proof format,” said

“You are locking your data in a metaphoric­al room, throwing away the key and hoping it will still be there in the future”

Howard Frear of Easy Software, which sits on the body’s board. “It contains all of the data and metadata within the document itself, so you don’t necessaril­y need an applicatio­n to open it, as there will always be an industry standard viewer.”

This will be more important to certain industries than others. Easy Software works with pensions providers, for example, who maintain their records for the life of each subscriber, plus 20 years, and need to know that the records they produce will still be accessible, potentiall­y, 100 years from now.

That’s not guaranteed with proprietar­y formats. “With Microsoft Word, older and newer versions, they aren’t that compatible,” Frear said. “Backwards compatibil­ity has been problemati­c but looking at forwards compatibil­ity is nigh-on impossible unless you have a standard.”

However, if PDF/A is the way ahead, when should the file actually be generated? At the point when we save our assets, or when they’re added to an archive?

“It should be a problem for Apple, Microsoft, IBM and Amazon, but it’s not,” explained Keogh. “For us to be looking after our data well, when we’re creating the data in whatever format, that’s when you should have the option to make it as future-proof as possible.”

“To some degree, it’s down to the user to put in some extra effort,” Frear said, explaining that Microsoft Word can output PDF/A using an add-in. “Perhaps developers could do a little bit more and store both copies as part of the single save function, but then everybody is battling against the volume of data that creates.”

Keeping data alive

It’s easy to forget when we have become so used to the idea of putting our assets in the cloud that it, like your local hard drive, is still a limited resource backed by fallible hardware. That’s why taking responsibi­lity for your own archive is essential.

“Cloud providers perhaps aren’t as mindful as the software community is,” Frear said. “Software and records management communitie­s are driving the standards and we need to remind cloud vendors that it’s all very well bringing in new hardware, but that they have a responsibi­lity to ensure that the data we put up to the cloud lives beyond the hardware’s usable life, and that as they move on to different hardware they have a responsibi­lity to move the data across smoothly,” he continued.

If that archive remains usable, so much the better. PDF/A looks like the best compromise, preserving both the final look of the archived document, and extractabl­e content for reuse.

“Could you read a WordPerfec­t file?” Keogh asked. “I couldn’t, not without an emulator, and that’s only from the 1990s, which from a data protection point of view, for something like the deeds of a house, someone’s pension scheme, a clinical trial or the research that meant you could bring a drug to market, is no time at all.”

Yet, despite warnings like this, a study published by the journal

Current Biology found that only a fifth of all the research published in the early 1990s remains accessible.

The Digital Preservati­on Coalition, founded by the British Library and JISC (Joint Informatio­n Systems Committee), published a list of the world’s endangered digital species at the end of 2017 ( dpconline.org/ our-work/bit-list). It classified data from marginalis­ed sub-groups and the photo archives of SMEs as critically endangered, requiring urgent action and assessment within 12 months. Even documents stored on Google Drive and Dropbox, where access is restricted to specific users, were listed as endangered, along with digital images with no analogue equivalent posted to social networks.

Archives and the right to be forgotten

The implementa­tion of GDPR this May will have implicatio­ns for archivekee­ping, which Freer described as “another piece of the puzzle”. Keogh sees potential conflicts – particular­ly over the question of what should and shouldn’t be removed on request.

“There’s a lot still to be ironed out,” she said. “When you talk about things like [archived] genome sequencing or thumbprint­s you need to start asking what is identifiab­le about an individual. Is it their NI number, their first and last name, their DNA sequence? You can’t take an individual out of [a study] because it skews the figures. Yet, they still have the right to be forgotten, so how do those two conflictin­g things work in reality?”

It’s likely the answer will become clear in the months following GDPR coming into force through trial cases and legal guidance. It illustrate­s once again, though, the crucial difference between a static backup that rots with age, and a live, accessible archive, which remains an asset for the organisati­on that created it years or even decades into the future.

“Even documents stored on Google Drive and Dropbox, where access is restricted to specific users, were listed as endangered”

 ??  ??
 ??  ??
 ??  ?? ABOVE Much of the BBC’s Domesday Project data was lost as the 1980s discs degraded, but some has been recovered and posted online
ABOVE Much of the BBC’s Domesday Project data was lost as the 1980s discs degraded, but some has been recovered and posted online

Newspapers in English

Newspapers from United Kingdom