GitHub shares the latest update on its Archive Program; the journey of the world’s open source code to the Arctic Circle in July.
The GitHub Archive Program was introduced along with the GitHub Arctic Code Vault at last year’s GitHub Universe 2019.
The stated mission was to preserve open source software for future generations by storing code in an archive built to last a thousand years.
In February this year, GitHub says it took a snapshot of all active public repositories on GitHub to archive in the vault. Over the past several months, its archive partners Piql, wrote 21TB of repository data to 186 reels of piqlFilm (digital photosensitive archival film).
The original plan was for the team to fly to Norway and personally escort the world’s open source code to the Arctic, but as the world continues to endure a global pandemic, plans were changed.
The journey began in Piql’s facility in Drammen, Norway where the boxes with 186 film reels were shipped to Oslo Airport and then loaded into the belly of the plane which provides passenger service to Svalbard. Svalbard, roughly 600 miles (1000 km) north of the European mainland, just recently opened up to visitors from countries within the Schengen Area and the European Economic Area.
The code landed in Longyearbyen, a town of a few thousand people on Svalbard, where the boxes were met by a local logistics company and taken into intermediate secure storage overnight. The next morning, it traveled to the decommissioned coal mine set in the mountain, and then to a chamber deep inside hundreds of meters of permafrost, where the code now resides fulfilling their mission of preserving the world’s open source code for over 1,000 years.
GitHub says its now happy to report that the code was successfully deposited in the Arctic Code Vault on July 8, 2020.
Millions of developers around the world contributed to the open source software now stored in the Arctic Code Vault.
The Internet Archive is a well-known, widely beloved non-profit digital library which provides free public access to collections of digitized materials. In partnership with the GitHub Archive Program, the Internet Archive (IA) commenced its ongoing archive of GitHub public repositories on April 13 of this year. At present, IA is using a two-pronged approach.
First, their well-known Wayback Machine is accessing and archiving raw GitHub data as WARCs, or Web ARChive files. As of July they have archived some 55TB of data. Second, they have the goal of making entire archived GitHub repositories available via “git clone,” while also keeping repo comments, issues, and other metadata easily accessible on the web. This second initiative is well underway and initial archiving is expected to commence this month.
Software Heritage Foundation
Software Heritage is a non profit, multi-stakeholder initiative launched by Inria in collaboration with UNESCO with the goal to collect, preserve and share the source code of our software commons. They already archive more than 130 million projects, with their full development history, and we are delighted to announce that 100 million of these are from GitHub.
GitHub says the archival engine is being improved with the goal to keep it up to speed with its growth, but if the project you are interested in, or its latest version, is not archived yet, you do not need to wait, click here.
Project Silica is developing the first storage technology designed and built from the media up for cloud-scale storage of long-lived data. By leveraging recent discoveries in ultrafast laser optics, data is stored in quartz glass, through a process that permanently changes the physical structure of the glass material. Quartz glass is a durable storage media that offers unparalleled data lifetimes of upwards of tens of thousands of years.
It is resilient to electromagnetic interference, water, and heat, making it the ideal storage medium for ensuring the world’s open source software is forever preserved for future generations. As a partner in the GitHub Archive Program, Project Silica says its committed to driving storage innovation, and developing a storage technology that addresses the need for a sustainable and reliable storage technology for the world’s long-lived data.
GitHub says its archived 6,000 of the world’s most popular repositories as a proof of concept for future archives.
The Tech Tree: Code, Culture and History
Every reel of the archive includes a copy of the “Guide to the GitHub Code Vault” in five languages, written with input from GitHub’s community and available at the Archive Program’s own GitHub repository. In addition, the archive will include a separate human-readable reel which documents the technical history and cultural context of the archive’s contents. We call this the Tech Tree.
Inspired by the Long Now’s Manual for Civilization, the Tech Tree will consist primarily of existing works, selected to provide a detailed understanding of modern computing, open source and its applications, modern software development, popular programming languages, etc. It will also include works which explain the many layers of technical foundations that make software possible: microprocessors, networking, electronics, semiconductors, and even pre-industrial technologies. This will allow the archive’s inheritors to better understand today’s world and its technologies, and may even help them recreate computers to use the archived software.
GitHub says encapsulating the world’s cultural context and technical history is a challenging prospect, and it expects the Tech Tree to evolve and iterate over time. GitHub says it will soon publish to the Archive Program’s GitHub repository a very initial draft list of works selected for the Tech Tree, along with a request for community input.
(Ed. Featured image provided courtesy of GitHub.)