FAQ | ARCHE

Archiving

How can I save my valuable research material for the future?

You should definitely go to one of the established repositories for research data. They are made just for that. In Austria, there are already several institutional repositories available. Which one to use depends on the type of resource you have, how it is to be used and also on your affiliation. You can search for a suitable repository on re3data.org.

Additionally you should make sure to use file formats suitable for long term preservation and provide sufficient documentation for your data (metadata) to enable others to understand your resources.

Finally, you should not only consider deposition of your data in a reliable repository but also open access for your data. Open Access is essential for reuse and thus longevity of data. Only visible and accessible data can be reused and thus be made more valuable. Many international and Austrian institutions have declared their support for Open Access. The Open Definition provides further details and lists conformant licences. Forschungslizenzen.de gives a comprehensive overview of open and restrictive licences and provides guidance for choosing a proper licence. If you want to use one of the widespread Creative Commons (CC) licences you can use their tool to choose a licence.

Have at look the FAIR Data Principles to learn about recommended measures for discovery and reuse of data.

What are the FAIR Data Principles?

FAIR stands for Findable, Accessible, Interoperable and Reusable data and metadata. The principles which were formulated by leading stakeholders in the field (representing academia, industry, funding agencies, and scholarly publishers) recommend and describe measures to foster discovery and reuse of data. The FAIR Data Principles are meanwhile also part of official European recommendations (https://www.force11.org/group/fairgroup/fairprinciples).

Do you accept data from anybody?

As part of the CLARIAH-AT infrastructure, ARCHE is primarily intended to be a digital data hosting service for the humanities in Austria. Thus data from all humanities fields including modern languages, classical languages, linguistics, literature, history, jurisprudence, philosophy, archaeology, comparative religion, ethics, criticism and theory of the arts are equally welcome.

Detailed information is provided in the Collection Policy.

When in doubt, get in touch!

Do you accept any kind of data?

ARCHE aims for a broad scope of humanities research data. We accept resources, encompassing digital texts, lexicographic resources, semantic resources, tabular data, databases, digital images, file collections like GIS, 3D or CAD, media files, and many more.

Detailed information is provided in the Collection Policy.

In case of doubt, simply contact us under acdh-helpdesk@oeaw.ac.at

What data formats do you accept?

See our list of accepted and preferred formats for archiving.

What does UTF-8 without BOM mean?

UTF stands for Unicode Transformation Format and is a set of character encodings for the Unicode character set. UTF-8 uses a byte (i. e. eight bits) to encode the characters. Other UTF-encodings like UTF-16 or UTF-32 use more than one byte per character and they can be stored with the most significant byte in first (big-endian) or last place (little-endian). Thus a Byte Order Mark (BOM) is needed, which is represented by the non-character U+FEFF.

Since UTF-8 is byte oriented a BOM is not necessary and should be avoided. An advantage in using UTF-8 is that the first 128 characters of ASCII are preserved and encoded in the same manner.

See the official FAQ about UTF-8, UTF-16, UTF-32 & BOM and the IT-Empfehlungen from IANUS for further information on this topic and encoding in general.

What is the actual deposition procedure?

Deposition and archiving involves work by the data provider and by ARCHE data curators. During the submission of digital resources to the repository, the data undergoes a curation process in order to ensure quality and consistency. We assist you in meeting necessary requirements for sustainable resource archiving: data have to be provided with metadata and in preferred formats, persistent identifiers (PIDs) have to be assigned, IPR issues have to be resolved and clear statements with regard to licensing and possible use of the resources are to be made.

Deposition involves four stages, which are detailed in the Deposition Process:

Preparation steps on your side before the submission
The actual submission and handing in of the data
Checks on the data on our side after we received the data, which can result in the need to review the data on your side and a resubmission
Actual archiving & publication of the data

How can I compile a list of files for my data?

In order to provide initial information about your resources for the ARCHE curators, a list of files is useful. You can create it by hand or automatically by using different tools. All main operating systems already provide on board functionalities like tree and dir on Windows, ls on Linux and Mac. Alternatively you can install dedicated tools, like DROID.

Why do I have to select a licence?

Providing a licence for your data makes it reusable and clearly describes the rights you give potential reusers. Reusing data with a licence is easier than without.

What licence should I use?

You should consider open access for your data, which is essential for reuse and consequently longevity of data. Only visible and accessible data can be reused and thus made more valuable. Many institutions already declared support of Open Access, including numerous Austrian institutions. The Open Definition provides further details and lists conformant licences. Forschungslizenzen.de gives a comprehensive overview of open and restrictive licences and provides guidance for choosing a proper licence. If you want to use one of the widespread Creative Commons (CC) licences you can use their tool to choose a licence.

We suggest the use of CC-BY (CC - Attribution) or CC-BY-SA (CC - Attribution-ShareAlike). When depositing software consider using specific software licences like BSD or GPL. You can use the License Selector tool to select an appropriate licence for either software or data.

How can the archived data be cited?

ARCHE makes use of the Handle System to assign unique and persistent identifiers to the digital objects. In such a manner, every resource has a uniquely identifiable URL that will always point to the same data, wherever it might physically move in the future. The handle is especially meant for citing the resources in publications. With additional information about creators and contributors ARCHE generates a suggested citation that is displayed along each resource.

What is a PID?

PID stands for persistent identifier and is a unique string, that is persistently assigned to a digital object. It is comparable to the concept of ISBN numbers assigned to print publications in order to identify them. A PID helps in identifying and referencing an object in a stable manner, irrespective of the actual storage location. Examples for PID systems are URNs, DOIs or Handles.

What if I want/need to update the archived data?

Every change to the resources and metadata is stored as a new version. If the changes are substantial or the two versions should both be equally available, a new object with a new PID should be created that is equipped with a link to the preceding version, which retains its PID.

How safe is my data in ACDH-repo?

ARCHE runs on the systems maintained by the Computing Centre of the Austrian Academy of Sciences (ARZ), which makes for a solid organisational and technical backing.
To avoid data loss due to deterioration of physical storage, malicious threats or other emergencies, redundancy is key for the preservation of data. Regular backups help us to protect and restore data.

Backups of the data in the repository are performed regularly: a daily copy is stored and replicated within the internal ARZ NetApp setup on-site. In addition, the data is replicated for off-site to the long-term storage in the computing center run by Max Planck Computing and Data Facility (MPCDF) in Garching, Germany. Checks of the integrity of the copies are performed regularly. We keep at least three copies at all times, one of them off-site.
Further details are described in our Storage Procedures.

What if I want to withdraw the resources in the future? Can I delete the data?

Yes, if needed. However we at least need to keep a reference that the data was there. Therefore administrative metadata will be retained indicating that the data itself was removed.The assigned PID will be kept and point to a tombstone page displaying the metadata.

Do I need to pay to deposit the resources?

The deposition and storage itself is free of charge. The repository is run as part of the research infrastructure as a service to the community. If the data requires further processing and extensive curation we might charge for the curation effort.

I don't want / cannot make the data publicly available. Would you still archive them for me?

In accordance with the advocacy of the research infrastructures and the general development with respect to Open Access, we strongly encourage the data producers to be as open as possible: publicly available data has a better chance to be picked up by fellow colleagues which is good for the reputation and the citation index. Public funding agencies increasingly require researchers to publish not only the results of their research, but also the research data.

However we are aware that the Open Access approach is not possible in all cases. IPR or ethical issues as well as strategic considerations may require more restrictive access modes. We will help you to select the right licence for your needs. If necessary, we also offer the possibility to just archive the data, without any public access.

Search, Resource Availability

How can the archived data be found?

The resources are published on ARCHE’s web site and can be browsed through the web interface.

Furthermore, metadata about the resources are offered for harvesting via OAI-PMH, allowing dissemination via additional channels, such as the Virtual Language Observatory, CLARIN’s central metadata catalogue.

Can I do anything with the resources? What are the regulations regarding access?

In general, the Terms of Use apply to the use of the resources and services provided by the ARCHE. Additionally resource-specific licences apply as stated in the description for every resource.

Do I need to pay to get to the resources?

No. All the resources are available free of charge.

Do I need register/login to get to the resources?

It depends. There are three basic modes of access: public, academic and restricted.

Public resources are accessible without any further restrictions. Academic access means that you have to be affiliated with an academic institution (e.g. be a member of a university). This is checked primarily via the so-called Federated (or Shibboleth) Login. If you cannot login via Shibboleth, but still are an academic person or you have academic motives to get to the resource, please contact us.

Some of the resources are only available on the basis of a special agreement. This is indicated by the "restricted" access mode which usually implies that you have to fill in a registration form and accept a special licence. In the worst case the resource is not available online at all. In this case, you need to contact us to find out if and how to get access to the resource.

What is this Federated (or Shibboleth) Login?

Shibboleth, AAI (Authentication and Authorisation Infrastructure), or SSO (Single-Sign-On) refer to an architecture where service providers rely on identity providers to authenticate users. I.e. if users want to use a certain service (like the ARCHE) of the provider, for which they need to authenticate, they are redirected to their home institution (e.g. university) where they can login with their institutional credentials. If successful, the home institution lets the provider know that they are entitled to use the service. In short, you can login to different services with your institutional account without the need to separately register every time.

This is similar to the OpenId initiative known in the "commercial" world (login to cool web page with your google or facebook account).

Given that this "Identity Federation" is established by academic institutions, it is implicitly assumed that if a user can login via Shibboleth, (s)he is an academic person.

What is OAIS?

The Open Archival Information System (OAIS) is a reference model developed by the Consultative Committee for Space Data Systems (CCSDS) and consists of a set of recommendations for archival systems dedicated to long-term preservation and maintenance of digital information.

Within OAIS a functional model consisting of six functional entities is described. Within these entities information packages are exchanged, either containing the original submitted information (Submission Information Package, SIP), the information prepared for archiving (Archival Information Package, AIP) or the information ready for dissemination (Dissemination Information Package).

More information can be found in publications by CCSDS, as for example in the Magenta Book.

What does SIP mean?

SIP stands for Submission Information Package and represents the information package that is delivered to ARCHE for ingest and archiving. The SIP contains the data to be stored and all necessary metadata about the package and its content.

When submitting a SIP please make sure to provide the data in formats suitable for long term preservation and that sufficient metadata is accompanying the package.

What does AIP mean?

AIP stands for Archival Information Package. It contains the metadata and the data submitted via the SIP, information about preservation and other documentation accumulated during the ingestion process. Data from the SIP might have to undergo file conversions to produce an AIP with data suitable for long term preservation.

What does DIP mean?

DIP stands for Dissemination Information Package. A DIP can be derived from one or multiple AIPs and is used to present the data and metadata to the consumer. The content of a DIP is presented in delivery formats which might be different from archival formats used in the AIP. Delivery formats are tailored to the bandwidth available and user requirements. A single file might be available in a variety of delivery formats.

Why did you switch the software stack from Fedora Commons to a custom-tailored solution?

When we were planning the implementation of ARCHE in 2017, after thorough evaluation of multiple existing solutions, Fedora Commons seemed the most suitable candidate, being widely used to run repositories around the world, and especially also multiple CLARIN Centres. However, the by then (and as of 2020 still) most widely employed version of Fedora (version 3) was announced to reach end of life and not to be developed any further. Thus it seemed natural to adopt the new version. Version 4 of Fedora Commons went through a complete redesign and re-implementation which abandoned many proven concepts and introduced technological decisions, which in hindsight turned out to be very problematic. Some of these decisions were revoked in an intermediate version 5 and currently (2020) work is done on version 6, for which a stable release is expected for beginning of 2021.

Meanwhile the problems in our solution grew. Though we were able to work around these, it came at the cost of spending time on developing work arounds. Additionally Fedora’s performance quickly deteriorated with the growing amount of data in the repository, making ingests of any bigger dataset next to impossible.

During the three years we gathered a lot of experience with the data we curated and ingested, enabling us to distill what the crucial features we require in a repository solution are.

When we came to the difficult decision to abandon our Fedora 4 based solution, we once again revisited other existing solutions, surveying if any of them delivers the features we expect. We came to the conclusion that none of the solutions serves our use case completely. Although all provide means for customisation and extensibility, they come with complex components, which have to be considered black boxes, even though they are open-source. This would presumably lead to similar experiences we had with Fedora 4, which we wanted to avoid.

Thus, against all usual advice and good practice, we decided to develop a custom-tailored solution from scratch, serving our specific needs. We strived to make it as generic as possible, to be applicable in a multitude of scenarios and use cases.

The system is based on a very conservative technology stack: plain strictly object-oriented PHP with a PostgreSQL database to store the metadata. The overall architecture is cleanly divided into multiple components, with a clear function and well-defined APIs exposing their respective functionality.

We preserved all functionality, so that both the user interface and the APIs behave exactly the same as before, just by an order of magnitude faster, with a order of magnitude lower resource consumption.