Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does...

48
Strain Data Networks The speciesLink and SIColNet experiences Dora Ann Lange Canhos Centro de Referência em Informação Ambiental - CRIA

Transcript of Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does...

Page 1: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Strain Data Networks

The speciesLink and SIColNetexperiences

Dora Ann Lange Canhos

Centro de Referência em Informação Ambiental - CRIA

Page 2: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Centro de Referência em Informação Ambiental

CRIA (Reference Center on Environmental Information)

a not-for-profit, non-government organization.

Its aim is to contribute towards a more sustainable use of

Brazil's biodiversity through the dissemination of high

quality data and information generated by the scientific

community

Page 3: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Focus for the GBRCN information

system

• Linking existing systems – information

exchange

• Quality – data validation

• Data content

• Data usability

Page 4: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Strategies for data integration

1) Same software and database used by all providers

centralized search

web server usersdata provider 1

data provider 2

data provider 3

search

Slide: Renato de Giovanni (CRIA)

Page 5: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Strategies for data integration

1) Same software and database used by all providers

� Interesting solution if all providers agree to use the same

system:

� Improvements benefit all participants.

� Shared costs.

� Good performance (although queries are run in the

production database).

� Lack of freedom to make custom adjustments.

� Very difficult to accomplish if providers are already using

their own management software (sometimes developed

with considerable effort).

Page 6: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Development

team

Coordinating

group

Sp

eci

ali

sts

(>4

00

)

New Developments

HELPHELP

Page 7: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Editing interface

Web interface

User control

Global corrections

saída rtf para impressão mapas de distribuição

Statistics interface

taxonomiststaxonomistscoordination JBRJcoordination JBRJ

saída xls planilha

data cleaning interface

logs and controls

Importing data from existing listsImporting data from existing lists

Maintenance, correcting bugsMaintenance, correcting bugs

New implementationsNew implementations

development CRIAdevelopment CRIA

Support to the coordinationSupport to the coordination Support to taxonomists

backups, backups, backups, backups …

External resources

Page 8: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Strategies for data integration

2) Periodically export data to a central database

centralized search

web server usersdata provider 1

data provider 2

data provider 3

standard format

Page 9: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Strategies for data integration

Examples:

• Common Access to Biological Resources and

Information.

• Began in 1999.

• 28 catalogues from European institutions

(>100K records).

2) Periodically export data to a central database

• 1st phase of the Brazilian network

Page 10: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

administrative

interface

virtual

catalog

users

updates queries

Data providers

(culture collections)

relational database

PostgreSQL

Perl &

Apache

HTTP

SQL

Page 11: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Strategies for data integration

�Good performance.

�Easier to implement.

� Queries are performed on potentially non current data.

� Onus on providers to transform data into a common format

and periodically export it.

� Experience with SICol: no updates

2) Data providers periodically export data to a central

database

Page 12: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Strategies for data integration

3) Real time distributed queries

distributed search

data provider 1

data provider 2

data provider 3

web server users

wrapper software

data standard & protocol

Page 13: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Strategies for data integration

Examples:

3) Real time distributed queries

REMIBRed Mundial de

Información sobre

Biodiversidad

• 1998 - 2003.

• North America.

• MaNIS, HerpNET,

ORNIS & FishNet.

• Started in 1998.

• Mexico.

• Started in 2000.

• 9 major herbaria.

• 6 million records (80%

databased).

Page 14: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Strategies for data integration

�Access to current data.

�Providers have more confidence and sense of control.

� Performance and scalability bottlenecks.

� Performance limited by the slowest data provider.

3) Real time distributed queries

� Servers sometimes down, network problems.

� When data providers go offline their data become

unavailable.

Page 15: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Strategies for data integration

4) Data harvesting

data provider 1

data provider 2

data provider 3

web server users

centralized search

data harvester

Page 16: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Strategies for data integration

Examples:

4) Data harvesting

Page 17: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Strategies for data integration

� Good performance.

� Queries are performed on potentially non current data.

� Difficult to implement if there are many protocols and data

standards involved.

4) Data harvesting

• It may be necessary to define a common (minimum) field

set for storing data in the central database.

Page 18: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Data Model: DarwinCore

• Based on specifications developed by the DublinCoreMetadata Initiative. Can be seen as an extension of it for biodiversity data.

• Its latest version consists of a glossary of terms including definitions, examples, and commentaries, including how terms:

– are managed

– can be used

– can be extended for new purposes

• Designed to minimize the barriers to adoption and to maximize reusability in a variety of contexts.

Page 19: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

On the evolution of Darwin Core

~20 different versions!

DwC

first

draft

DwC2

fullDwC2

taxonomy

DwC

terms

2002

48 elements

XML tied to a protocol

2009

172 elements

Generic

Note: ABCD

has 970

terms

DwC2

geography

DwC2

taxongeog

DwC2

gazetteerDwC

1.0

2003 DwC

OBISDwC

1.21

MaNIS

DwC

1.25

DwC

bnhm

DwC

AKNS

1.32

DwC

2Plants

DwC

paleo

DwC

kbif

DwC

kbif 2

DwC2

jrw030315

DwC

1.4DwC

1.4

curatorial DwC

1.4

geospatial

Standardization required!

Page 20: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Data exchange protocol - TAPIR

• TDWG Access Protocol for Information Retrieval.

• Integrates functionality from DiGIR and BioCASe.

• Completely independent of the data being exchanged: Works with DarwinCore and ABCD.

• Official TDWG standard.

• Tools and documentation available.

Page 21: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Choosing standards and protocols

• Choose from existing standards whenever possible:

• This can save you considerable time.

• Will likely avoid interoperability issues in the future.

• Seek compatibility with other initiatives.

• You can benefit from existing tools.

• You may get extra functionality/data.

• Data providers are the pillars of every network:

• Help them improve their data.

• Ensure that data remain curated at the source.

• Show them that data sharing promotes citation and usage, giving them credits and visibility.

Page 22: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

The speciesLink network

descriptive

data

nomenclature

taxonomy

modeling

Data quality

maps

primary data

educationresearch

Decision

making

Biological

collection

Page 23: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Points to consider

• Biological collections in Brazil

– A small number of “large” collections

– A great number of important small research

collections

• Average characteristics

– Human resources: expertise in informatics (normally

insufficient)

– Equipments and installations (normally insufficient)

– Connectivity (normally slow or unstable)

Page 24: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Challenges

– Integration of primary data from all taxa, from

distributed collections, using different software in

diverse environments

– Integrating data from collections with low and/or

unstable internet connectivity, using basic

hardware and no computer expertise

– Maintain full control over the data served to the

network at the provider’s end

Page 25: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Development parameters - architecture

– Collection’s routine must not be altered

• Practically any software is accepted (Excel, Access, Specify, Biota, Brahms, PostgreSQL, MySQL, …)

– Data provider must have full control over the data

• What is sensitive data, what is open and free

• Digitization strategy, data cleaning strategy

– Data provider must be fully acknowledged

– Connectivity problem must be overcome

– Network must be interoperable with international initiatives

Page 26: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Network architecture

Data

PostgreSQL

Provider

PHP

Mirror

SOAP server

SQL

DataspLinker

Java

CollectionManagement

System

SQL

Collection Collection

Data

Repository

SOAPSOAP

Portal

DiGIRDiGIR

Cache nodeCache node

Translator

Collection

database

Map

pin

g d

ata

fie

lds

Darw

inC

ore

da

ta m

odel

On-line

database

Filter for sensitive data

Free and open access to non sensitive data

Restrictedaccess

Sensitive data

flagged

John Wieczorek

Museum of Vertebrate Zoology, UC Berkeley

Page 27: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

spLinker: software to send data to cache node

� Platform Independent (java)

� Connects to practically any

database

� Offers full control over data

� Checks repository and only sends

updates (low traffic)

� It is possible to filter sensitive

data using regular expressions

dataspLinker

(Java)

Management

system Data

repository

Page 28: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

The development of the speciesLink network

Oct. 2005

709,306 recordsLaunched

Oct. 2002

5,280 records

Estimated

Growth

1.7 million

3.5 million

Page 29: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Data Sharing

• Does not depend only on the will to share

data

– It must be planned: adequate resources,

expertise, infrastructure

– Must be organized: data models, controlled

vocabulary, communication protocols, ...

– Must be easy or at least “doable”

– Must have a compatible data policy: free and

open access to non sensitive data

Page 30: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

TAPIR Portal

Cache node Cache node

Indicators

Network Manager

Query interface

Data cleaning

Web Sitemapcria

webservice

Data analysis

Reports

Maps

PostGIS

Central Repository

Data Harvester

Collections with

a DiGIR provider

Collections with

spLinker

DiGIR/TAPIR

SOAP

TAPIR

WMS

speciesLink architecture

TAPIR Provider

webservice

Page 31: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

� Adoption of internationally agreed standards and protocols is key

� Support unlocking and sharing of data (make it simple and easy !)

� Enable data providers to have full control of their data determining what can be

openly shared and what is sensitive

� Full credit and acknowledgement to the data providers at all levels !

� Data providers must see the benefit to participate in the network

� Data flagging and data cleaning tools are key to support the identification of data

inconsistencies

� Stable and long term funding is necessary to ensure development and the

persistency of open and free data networks (persistent repositories are critical;

funding mechanisms need to be improved !)

Lessons learned

Page 32: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

CLIOC

CBMA

ColTryp

CMT

TAPIRLinkTAPIRLink

CCFF CCGB

CENT CCBH CLIST CCAMP

CCBSCCBS

μSICol

INCQSCFP

Putting the pieces together

Fiocruz - RJ

Page 33: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

global catalogue distribution maps indicators reports

CLIOC

CBMA

ColTryp

CMT

TAPIRLinkTAPIRLink

CCFF CCGB

CENT CCBH CLIST CCAMP

CCBSCCBS

μSICol

INCQSCFP

Fiocruz - RJ

CBAM

CFAMμSICol

Fiocruz - AM

spLinkerspLinker

CBMAI μSICol Unicamp - SP

datacleaning reports

speciesLink (DarwinCore2)speciesLink (DarwinCore2) SIColNet (DarwinCore2 + microbial)SIColNet (DarwinCore2 + microbial)

the pieces put together

Embrapa

Infomicro

Page 34: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Quality level ???

Page 35: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2
Page 36: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2
Page 37: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2
Page 38: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Quality Procedures ???

Page 39: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2
Page 40: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2
Page 41: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2
Page 42: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2
Page 43: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2
Page 44: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2
Page 45: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2
Page 46: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Making things possible

• Setting big goals, but…

– Step by step approach

– Problem driven approach

• Make it simple or at least doable

Page 47: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Partners

National PartnersNational Partners

Sponsors & FundersSponsors & Funders

International PartnersInternational Partners

Page 48: Strain Data Networks - cria.org.br · PDF filetaxonomy modeling Data quality ... • Does not depend only on the will to share data ... speciesLink (DarwinCore2) SIColNet (DarwinCore2

Thank you

Dora Ann Lange Canhos

[email protected]