How We Built Our Internet Archive Integration

Permanent Admin

November 7, 2019
internet archive

At Permanent, we make it possible for anyone to preserve their personal digital legacy and make it perpetually accessible to future generations.

That’s why we take a multi-provider approach and work with both commercial storage solutions as well as museums, libraries, and other digital archives. Multiple copies stored in many different places is the only way to guarantee preservation.

So when we learned that the good folks at the Internet Archive provided a public API that we could use to publish our user’s digital materials to their system, we knew we had to build an integration for it. The result was a surprisingly easy to use feature. Check it out how we did it.

The Internet Archive’s File Architecture

In the Internet Archive, once something goes in, it’s not meant to come out. The Internet Archive is designed to preserve a public, permanent, online copy of whatever content they are provided to store – a snapshot of their state at a given moment. Documents, media, and files are intended to be preserved in an unchanging state once they are uploaded into the system whether that’s their only form, final form, or somewhere in between.

While this indelible state might not seem like a good feature for a typical cloud storage provider, Permanent is not your typical cloud storage platform. Preservation is the Permanent mission and we deeply value the fixed, public nature of the Internet Archive to serve that mission. Working with files in an essentially “read-only” state is complementary to our own, internal publishing feature.

However, the way in which the Internet Archive’s content is organized is not compatible with our hierarchical file storage system. The Internet Archives file hierarchy looks like this:

  • “An item”, designated by a unique identifier, can contain any number of files
  • “A collection” can contain any number of items.

Internet Archive collections, in their current form, don’t lend themselves to arbitrary groupings the same way a user might create folders in their own file system. They are used to thoughtfully curate categories of items across their system. Importantly, they are also locked behind administrative accounts to limit editing or creation. That makes sense given the Internet Archive’s focus on archival scenarios, but being able to take an arbitrary folder structure and approximate it in their system is a bit tricky.

The Internet Archive’s metadata schema is nicely accommodating; it’s based on Dublin Core and constraints on the metadata values of individual items are free-form. Although there’s a basic set of universal metadata vocabulary across their system – i.e.name of the item, who uploaded the item, what collection the item belongs to – it is possible to store completely arbitrary key-value pairs on an item through their API, even if that data is never surfaced in their UI. Some metadata fields are even allowed to be updated after the creation of an item, but fields related to file organization like the unique identifier or the collection to which an item belongs are not.

Permanent.org’s File Architecture

Permanent does use a more typical hierarchical file and folder structure familiar to most file systems, with users able to arbitrarily create and nest folders, and move files between any folder they choose. Users are also able to freely edit, update, and delete files whenever they’d like in while they are private. We only restrict files to read-only once they are published, but even then allow deletion. Even though archiving content is a core focus of Permanent, our private first, public second model accommodates typical consumers who are familiar with this file system paradigm, which allows them to organize files whichever way they’d prefer.

In Permanent, metadata exists as a specific set of fixed fields across all items in our system. Name and description, date information, associated location information if any – these are fields present on any item. The default set of fields is all that’s available. There’s no extensibility system that allows for arbitrary metadata fields.

Mapping Between Architectures

Given these differences between our two systems, how does one go about integrating one into the other? The design constraints are simplified by the Permanent use-case. The goal of the integration is to enable users to publish materials to the Internet Archive, so we only have to consider a push case, not a pull or synch. To push files from Permanent to the Internet Archive simply requires mapping a hierarchical system to a non-hierarchical one.

One option is to limit the integration to only permit individual files to be published to the Internet Archive, and to publish them to an all-inclusive Permanent collection. Single files, by the nature of being single files, don’t require any hierarchy, so there’s nothing to map or fold down – one file on Permanent is one file in the Internet Archive as well. While this works for the purposes of Permanent and wouldn’t impact the accessibility of files or their metadata on the Internet Archive, this wouldn’t be the best solution for users who meticulously organize their files.

Another approach would be to leverage collections inside the Internet Archive by generating an arbitrary number collections to serve as an analog for folders. This solution would have been a nice 1:1 mapping of any file and folder to items and collections. However, Internet Archive collections are intentionally limited by system administrators and thoughtfully created as the need arises to collect a very large number of items. Other options would be to create collections per-user or even leverage file metadata or object recognition to assign materials to collections based on machine learning or intelligent algorithms.

After exploring those options, we decided that for our prototype integration, the permanent, public-first, read-only nature of the Internet Archive renders flattening our folder structure in a one-way manner an acceptable solution. There would be no need to reverse the process and rebuild the folder structure in the future. The folder that a Permanent.org user publishes serves as the main root item in the Internet Archive. Any files inside that the root folder are uploaded as files on that item. Nested subfolders of the root are flattened and the contents of these subfolders all show up in the item corresponding to this published root folder.

While this approach was not a good fit for nested folder structures, some level of compromise is inevitable, and this feels like the right one for a prototype integration. We will monitor utilization of this feature to determine if an adjustment needs to be made and a future version is likely to leverage collections in some fashion.

Solution Implementation

There are a few ways to interface with the Internet Archive’s data store. Regular HTTP interaction with their relatively simple public API is one option and depending on your tech stack, someone may have already written a wrapper that would make integration straightforward. They’ve also created an S3-compatible layer on top of their API that allows developers to execute some of the basic write operations available in any Amazon S3 SDK. In which case, many developers might be able to write files to the Internet Archive without changing much of their existing system code at all.

Since the Permanent.org integration was intended to be a prototype, we chose to explore using a third option: the official Internet Archive CLI utility. With uploading as the only needed function from the API, using the CLI utility as part of our backend queue service made it incredibly simple on our part to get files into their system. The CLI handles auth, errors, retries, everything. The Permanent.org file processing pipeline downloads files to disk before working on them. Therefore pointing the CLI utility at those local files was a straightforward solution for us.

After choosing to use the CLI, we needed to wire up the process that’s kicked off when a user decides to publish items to the Archive. Due to the potential size of a user’s selected materials, we created a new background task in our existing pipeline to handle this publishing process. Once a user selects an item on Permanent to publish to the Internet Archive, a new unique ID is generated from the Permanent ID of the item to use as the Archive ID, and they get a link to the Archive where this content will soon be available.

The background publish task is queued up, tagged with the Permanent ID of the item to publish and a new Internet Archive ID that will be used as the destination for the file upload. If it’s a single file, that file is pushed out to the Archive, and that’s the end of it. If it’s a folder, all top level items in that folder get their own background task queued up, tagged with the same Internet Archive ID as a destination. This process recurses down through all subfolders, finding the files they contain, and uploading them to the same Internet Archive item, as tagged by that shared ID.

When this process is complete, all the contents of the selected folder are available under that single item, flattened to one single level, and currently, as we are only uploading the original file to the Internet Archive, only the original filename and any original, embedded metadata is preserved. The location where any given item ends up in the Internet Archive is tracked by a table in our database, which prevents duplicate uploads of files. We also limited the integration to user files on Permanent that were already public and read-only as well, so that the unique IDs used on the Internet Archive are simply generated from unique IDs used on Permanent and there is a true 1:1 mapping of IDs.

A different system might choose to structure an integration to function as a snapshot tool instead and may require individual, unique copies published to the Internet Archive, so that the state of a set of files or folders at a given time is always represented and never overwritten or augmented to. In that case a file ID on that system might correspond to many IDs on the Internet Archive one for each snapshot.

Tell us what you think!

We’re eager to hear what you think of our approach. Leave us a comment below. Let us know if you’ve tried a similar integration with the Internet Archive or if you see another way we could tackle this solution. Give Permanent a try and see how our private first, public second storage approach and the Internet Archive integration works for you, then send us some feedback.

Subscribe
Notify of
guest
2 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Andrew Atwood
Andrew Atwood
4 years ago

Wow…what a thing!

Andrew Atwood
Andrew Atwood
Reply to  Andrew Atwood
4 years ago

Second comment….


Subscribe to receive Permanent.org insider updates

* indicates required


Archives

Archives


Categories

Tags