eArchiving in ActionWorkshop on 25, 27, 28 January 2021
European Commission, DG Cnect
Interactive Technologies, Digital for Culture and Education Unit
Rehana Schwinninger-Ladak, Head of Unit, <[email protected]>
Adelina Dinu <[email protected]>
Fulgencio Sanmartín <[email protected]>
eArchiving in Action: Data ProducerseArchiving Workshop – 25th January 2021
DIGIT
Directorate-General
for Informatics
DG Connect
Directorate-General for Communications
Networks, Content and Technology
E-ARK Consortium
Agenda
• Demonstration
• Use Case
• Questions & Answers
• Panel Discussion
• Final Questions & Answers
3
Database preservation toolkit
Luís Faria
KEEP SOLUTIONS
TIP: Delete the picture and click
the placeholder button to select
another picture. Change the
background color
https://www.introducingporto.com
4
Databases
The information that supports institutions and businesses is usually centralised
on databases.
This information is of great value and needs to be preserved for decades due to
strategic and legal reasons.
The systems that have this information are usually complex with many software
components playing their part for supporting the business-logic, and the
submission and presentation interfaces.
The information is usually laid out in an organisation specifically optimised for
the database and original business objectives (i.e. not in a user-friendly
organisation).
5
The problem with preserving databases
• Every vendor has their data types and export formats
• It is rare that information exported from one vendor’s system works on another
• Sometimes does not work on different versions of the same product
• Need for a vendor-agnostic format based on standards
6
Preservation format criteria
Ubiquity Stability Complexity
SupportEase of identification and validation
Interoperability
Disclosure Intellectual Property Rights Viability
Documentation quality
Metadata support Re-usability
https://www.nationalarchives.gov.uk/documents/selecting-file-formats.pdf
7
SIARD: Software Independent Archiving of Relational Databases
• Database preservation format
• Based on international standards
• For database data, structure and behaviour
• Swiss national standard eCH-0165
• Now managed by DILCIS board and the EU eArchiving building block
https://dilcis.eu/content-types/siard
https://ec.europa.eu/cefdigital/wiki/display/CEFDIGITAL/eArchiving
8
https://database-preservation.com
DBPTK
Database
Preservation
ToolkitSet of tools to store relational databases
in a standard archival format.
9
DBPTK DesktopDesktop application to save database to preservation format, validate it, and browse and search the content
DBPTK EnterpriseWeb application to browse and search on the content of multiple large preserved databases
DBPTK DeveloperA command-line tool and development library for automation and system integration
10
DBPTK Desktop features
SIARD creation
Export database to a preservation format
• Connect to a local or remote database
and save all content into a
preservation format like SIARD
• Test connection will diagnose most
common problems and provide you
with helpful hints to solve them
Supported DBMS:
• Microsoft Access
• Microsoft SQL Server
• MySQL / MariaDB
• Oracle
• PostgreSQL
• Progress Openedge
• Sybase 13
DBPTK Desktop features
Migration reportDetailed report of migration changes and losses
• All export and selection
parameters are presented.
• All column data types mapping to
standard types are recorded.
• All compromises are documented.
14
DBPTK Desktop features
Edit SIARD metadataEnrich archived database with descriptions
● Add descriptions to database, tables and columns to better understand its contents.
15
DBPTK Desktop features
SIARD validationValidate archived database
● Validate SIARD against specification plus many additional checks for a thorough validation.
16
DBPTK Desktop features
Search recordsBrowse and search database content
● Google-like search on the database content.
● Drill down on specific tables and do advanced search for specific fields to find exactly what you are looking for.
17
DBPTK Desktop features
Auto-updateAutomatic check of updates
● Stay up-to-date with automatic update check on startup and installation of new versions.
Yes No
18
DBTPK Enterprise features
Enterprise architectureFor large institutions with many databases and users
● A web application that can be horizontally scaled to support many very large databases being accessed by many users.
20
DBTPK Enterprise features
Manage multiple databasesSingle system, multiple databases
● Search through the databases, manage their status, enrich their metadata, validate them, make them ready for users to search.
21
DBTPK Enterprise features
Data transformationTransform content to answer useful questions
● De-normalisation
and table and
column hiding, to
simplify browsing
and allow
anonymisation of
content
22
DBTPK Enterprise features
Data transformation (aka denormalisation)
person
Name Birth City name MayorCountry
name
Mary 1986-03-28Payne
SpringsMary
United
States
Phillip RosenhaynUnited
States
23
DBPTK Enterprise features
Single sign-onSupport for multiple protocols
● LDAP, Active Directory, Database, SAML, ADFS, OAuth2, OpenID, Google, Facebook, Twitter, FIDO U2F, YubiKey, Google Authenticator, Authy, etc.
● Supports internal authorisation definition or configurable external authorisation
24
DBPTK Enterprise features
Browse and searchAllow users to access database content on the Web
● Allow them to search on a prepared, user-friendly and anonymised database content
25
DBPTK Enterprise features
Export featuresExport data into tabular data
● Allow users to save search results in Microsoft Excel or other spreadsheet software format for easy analytics and diagrams
26
DBPTK Enterprise features
Activity logAudit every access
● Who has done what, when and from where.
● Requirement for ISO 16363 certification.
27
DBPTK Enterprise & Desktop
Interface translated into:
English, German, Estonian, Czech, Portuguese
Search stemming and stopwords support for:
English, Arabic, Bulgarian, Catalan, Czech, Danish, German, Greek, Spanish, Estonian,
Basque, Persian, Finnish, French, Irish, Galician, Hindi, Hungarian, Armenian, Indonesian,
Italian, Latvian, Dutch, Norwegian, Portuguese, Romanian, Russian, Swedish, Thai,
Turkish, Japanese (using morphological analysis), CJK bigram (Chinese, Japanese, and Korean languages)
Multiple languages supported
28
DBPTK Developer features
Command line interfaceAutomation of periodic preservation tasks
● Command line interface allows easy automation of periodic tasks like saving database to preservation format, validating, and editing metadata.
30
DBPTK Developer features
Systems integrationJava library
● Library to allow integration of production systems to directly use database preservation features.
31
DBPTK Developer features
Open sourceFor custom development
● Code base that allows custom development of new features or specialised support for new or legacy database systems.
32
And many more features
For archiving databases:
• SSH Tunnel
• Selection of tables and columns
• Selection and materialisation of views
• Custom views
• External files (files stored outside the DB)
• External files via SSH tunnel
• Automated quality assurance
• Save LOBs outside SIARD file
• Migrate from SIARD to SIARD
• Migrate from SIARD to live DBMS
• Convert ORACLE geodata
For accessing archived databases:
• Configure visible tables
• Configure visible columns
• Set column name, description and order
• Binary columns advanced options
• REST API
• Load on access and auto-unload
33
How DBPTK can be useful for data creators?
To archive and provide access to:
• Legacy databases
• Legacy information systems that
are supported by databases
• Production databases or systems
(snapshots or incremental)
• To restore archived databases into
modern database management
systems
• To alleviate the load of production
systems
35
Contact us
© European Union, 2017. All rights reserved. Certain parts are licensed under conditions to the EU. Reproduction is authorized provided the source is acknowledged.
More information at:https://database-preservation.com
DBPTK
Database
Preservation
ToolkitSet of tools to store relational databases
in a standard archival format.
See full webinar (#6)
on https://ec.europa.eu/cefdigital/wiki/display/CEFDIGITAL/eArchivi
ng+webinar+Series+2020
video at https://youtu.be/D-MZS1vloWc?t=1973
36
Use CaseNorwegian National Health Archive
TIP: Delete the picture and click
the placeholder button to select
another picture. Change the
background color
Source: Woodcon
Stephen Mackey, Piql Hanne Mari Hindklev, Norwegian Health Archives 37
National Health Archive Project
• A Digital Preservation System built to preserve both digitised and electronically created patient archives for perpetuity
• The HARI (Health Archive Register Index) keeps track of the journals and metadata in the Archive
• The standard for digitised material is as close to the electronically created material as possible
• A separate standard for a “submission index” that populates both the metadata in the register and in the SIP
• A tool: EHR validation tool created to do the structural control/analysis of EHR-extract before making EPJ-SIP
38
● Virtualised onsite environment
● Automated deployment (Ansible, Ansible Tower)
● Test, QA and Production environments
● Expected throughput○ 350,000 journals
per year○ 450GB per day
39
Use Cases for a Health Archive
The Norwegian regulation envisions two possible use cases for the archive when built, which are to:
o provide records to next of kin in compliance with open information regulation
o harvest the vast amount of historical healthcare-related data within the archive for medical research
There is no limit to the age of the records to be presented to the NHA from hospitals and so consist of physical and electronic patient records.
40
NHA EPJARK and DPJARK
• Norwegian standards for extraction of Patient Medical Records from source EHR/EMR systems or digitisation of Journals
• Legislation defines the use cases for the archive• The standards define the metadata (patient personal and clinical) that
should be included in archival packages• The standards present a taxonomy for archiving of Patient Medical
Records (Case, Sub-case, Documents, File)
But,
• The standards used are not based on international medical (excluding ICD) or archiving metadata standards (e.g. METS, PREMIS, FHIR)
41
EPJ Submission Case Taxonomy
Single Document Multiple Cases, Documents
Multiple Cases, Sub-cases, Documents
42
NHA Lessons Learnt
• Two big suppliers of EHR-systems in Norway – want to be a part of defining the workflow
• A project is started with testing both extracting EHRs and transferring the data
• EPJARK and the associated standard of the AVLXML needs to be understood by the vendor of the EHR-systems
• EHR validation tool is important to avoid going into a loop
• Needs to define limitations of the EHR-extract, max GB, number of patients, etc
43
eHealth1 Content Information
Type Specification (CITS)
Defined in the CEF Telecom Call for Proposals 2019 as “… specifications for eHealth will be developed by the Activity. One specification will be based upon the Norwegian eHealth archives transfer format of patient journals (from provider EMR systems to a central health archive).”
44
https://www.shutterstock.com/
eHealth1 Specification – Summary
• Builds on the Common Specification (CSIP) and package specifications (SIP, DIP, AIP) structures
• Uses NHA use cases as foundation
• Submission agreements are mandatory
• Extractions in Case/Sub-case/Document/File structure (from simple to complex) based on EPJ specification
• Makes allowance for encapsulated bitstreams (such as DICOM)
• Can be used in digitisation programs or for born digital extractions
• The specification does not consider extraction from centralised EHR systems or submission via CDAs, but this is a possible future enhancement
45
eHealth1 Specification-
Metadata
• Extensible descriptive metadata model
• Builds on the Common Specification (CSIP) through use of METS (Metadata Encoding and Transmission Standard) and PREMIS (Preservation Metadata Implementation Strategies)
• Patient-centric - recommends use of FHIR Patient resource
• Extensible clinical metadata -recommends use of FHIR resources such as: Condition, Allergy Intolerance, Procedure, etc
46
https://www.shutterstock.com/
eHealth1 - Next Steps
• Software development• eHealth1 SIP Creator tool
(November 21)• Pilot implementation of
an eHealth archiving solution based on piql/NHA and E-ARK software
47
https://www.shutterstock.com/
Panel discussion
- Moderator -Carlota Bustelo
Gabinete Umbus SL
TIP: Delete the picture and click
the placeholder button to select
another picture. Change the
background color
Pont Royal seen from Quai Voltaire Christoffer Wilhelm Eckersberg
Statens Museum for Kunst
50
The European directive on open data and FAIR principles: Impact on long-term preservation of government and research data
The Directive on open data and the
re-use of public sector information
provides a common legal framework
for a European market for
government-held data (public sector
information). Although focusing on
public sector information, its
transcription to national law also
interlinks with research data and the
adoption of FAIR principles. This
panel will join government officials
and research communities to
debate what will be the impact of
this directive on data long-term
preservation across the EU member
states.
51
Directive (EU) 2019/1024 of the European
Parliament and of the Council of 20 June 2019 on
open data and the re-use of public sector
information
- Whereas statement #59
Member States should also facilitate the long-term
availability for re-use of public sector information, in
accordance with the applicable preservation policies
- Article 9. Practical Arrangements
Member States shall also encourage public sector
bodies to make practical arrangements facilitating the
preservation of documents available for re-use
After discussion,
consultations and adoption
as principles in 2016, the
‘FAIR Guiding Principles for
scientific data management
and stewardship’ were
published in Scientific Data
https://www.panosc.eu/dat
a/fair-principles/
52
Speakers
53
José Borbinha
INESC-ID, Lisbon
University
Joy Davidson
Digital Curation Centre and
University of Glasgow
Igor Kuzma
Statistical Office of the
Republic of Slovenia
Andreas Rauber
Technical University, Vienna
Daniele Rizzi
European Commission – Unit
G1: Data Policy and
Innovation
Questions
1. In your opinion, what is the relevance of digital preservation for the re-use of
open data?
2. What should be the role of archives and other memory institutions on the
preservation and re-use of open data?
3. In your experience, do you find there is a need that different projects
converge to common standards? In the projects you are involved, what steps
should be taken to achieve this?
4. How can the eArchiving Building Block support the implementation of the
Open Data Directive?
54
Top Related