Data

Chris Lang and Barnum Brown stand on two sides of a Tyrannosaurus rex fossil skeleton laid out on a table and platform for its head.

Charles Lang and Barnum Brown working in laboratory with skeleton of Tyrannosaurus rex.
Thane L. Bierwert/© AMNH

Collecting Data

The information that is associated with a fossil, which is known as specimen data, is almost as important as the fossil itself.

Specimen data can help confirm the identity and age of the fossil; provide information about the paleoenvironment of the site where the fossil was found; help other researchers find the site again, so that further collections can be made; guide curators, conservators, and preparators in making decisions about how best to treat the material; and provide information on the history of collecting. Without these data, the fossil is much less “valuable” to scientists than it would otherwise be. For this reason, collection managers spend almost as much time managing data as they do looking after the fossils themselves.

The Darwin Core Standard

Because so many institutions collect data about specimens, and want to share these data efficiently, there are now major efforts within the scientific community to reach agreement on the basic types of information that should be collected. This “data about data” is known as “metadata” and the list of agreed metadata for natural history specimens is called the Darwin Core standard (or DwC for short). The Darwin Core is defined as “a specification of data concepts and structure intended to support the retrieval and integration of primary data that documents the occurrence of organisms in space and time and the occurrence of organisms in biological collections.”

Sharing data is an important form of collection access. Researchers need to get access to this information in order to make the most effective use of the collection, while members of the public rely on interpretations of the data to understand specimens that are on display. However, it’s important to remember that collections data are valuable proprietary materials. Institutions need to keep this information secure and develop policies about what information to release and under what conditions it should be shared.

Important Types of Data

Taxonomic data

This information relates to the identification of the fossil – what type of organism it is. This information often changes with time as the fossil is worked on by different researchers. The identification may become more detailed (e.g., down to genus and species, rather than some higher taxonomic level) or may change because new evidence that emerges during research. It is important to remember that taxonomic identifications are only opinions; for this reason, taxonomic data include not just the currently accepted name of the specimen, but also a record of all the different identifications that have been made since a specimen’s discovery – this is called a “taxonomic history.” Scientists may return to the taxonomic history many years later to get insights into why a previous researcher came to a particular conclusion about a specimen.

Locality data

This is information about the place where the specimen was found – its locality. It usually includes not just the name of the locality, but also geographic information, such as the country, state, and county where the locality is situated. Sometimes there will be more detailed information, such as coordinate data; these may include latitude and longitude, UTM or GIS coordinates, or township and range information. There may be information that is intended to help find the site, such as a narrative description (e.g., “100 yards upstream from highway bridge, at base of bluff”), annotated maps, or aerial photographs. Locality data is particularly sensitive because of the need to protect fossil sites from illegal collecting, to protect the rights of landowners, or to ensure that they remain secure from other researchers while the institution and its staff completes a multi-year research program. For this reason, it is a good idea to develop a policy that specifies the level of the information that can routinely be provided in response to public enquires – e.g., generally nothing more specific than county-level information.

Stratigraphic data

This is information about the geological context of the specimen. It might include the geological age of the rocks in which the fossil was found, the name of the rock formation, or more specific, narrative information about the position of the fossil within the site (e.g., “2 meters below the purple layer”). Stratigraphic data are important because they allow researchers to relate the fossil and the site where it was found to other fossils and fossil localities locally, nationally, or internationally.

Contextual information

Data about the immediate surroundings of the fossil – for example, was it found in association with other specimens, what was the orientation of the specimen within a level or quarry, or were there differences between the matrix surrounding the fossil and the rocks in the rest the site. Contextual information can be very important in understanding the environment in which the organism lived and died, as well as the taphonomy or paleoecology of the specimen.

Provenance data

This is information how the specimen was obtained. It could be the name of the person who collected the specimen and the date when they found it. Alternatively, it might relate to a gift or donation, or a specimen purchase. As well as being important for dealing with later questions of ownership (see Acquiring) provenance data can provide a link between the specimen and archival information like field notes, journals, or correspondence which may be an important source of additional information about the fossil.

Treatment history

This is a record of what has been done to a specimen since its discovery in terms of preparation, sampling, or repair. Treatment histories are vital because they let future workers determine what changes have been made to the specimen and what features of the specimen may have been altered or lost. They may also reveal the source of problems that can arise, for example from the use of inappropriate materials, which can guide preparators or conservators in making repairs or treatments. It might also include information about photographs taken, CT images or laser surface scans, etc.

Legacy data

These are pieces of information that are no longer in active use, but may still be important to our understanding of the specimen. For example, if a fossil is given a new catalog number, it is important to keep a record of the old number – it may have been cited in a publication, or referred to in important historical correspondence. Collectors’ field numbers are important pieces of legacy data; they link the specimen to field notes, which may provide essential locality or stratigraphic data that has not been recorded elsewhere.

Sharing Data

What is primary versus secondary data?

Another way to think about specimen data is to consider who generates the information. The ultimate source of data about a specimen is the information first recorded by the collector in the field. This is known as primary data; there can be only one primary data source and all subsequent transcriptions of the information, such as catalog records, specimen labels, etc., are secondary data.

It is important to remember this distinction when doing work like cataloging, which involves transcribing primary information. You should always retain a “verbatim” copy of the information recorded by the collector; “correcting” spellings, or changing punctuation, may actually lead to you losing information that the collector was trying to communicate.

Databasing: Storing Data

Many large institutions have their collection information in databases that are capable of storing not only specimen data, but also different types of associated media (e.g., specimen images), and which can make these available and searchable by the academic community and general public via the internet.

The choice, design, and building of a collection database is an enormous topic, far beyond the scope of this website. But because it is such an important topic, what follows is a quick overview of some of the basic things to keep in mind for any collection.

Data Standards

One of the first things to think about before embarking on building a database is what types of information will need to be stored in it. Obviously, if you already have a card or paper catalog, this will provide a set of fields to work from. However, if you are intending to share your data, then it’s important to make sure that you are collecting the same kind of information, and storing it in the same way, as other institutional collections.

Increasingly, museums and other institutions that hold natural history collections have been looking for ways to allow researchers and the public to use the internet to compare data from different collections. For this cross-collection searching to work effectively, different institutions need to be using the same core set of data types, known as a “data standard.” In recent years, the natural history collections community has been working to develop a standard for natural history specimen data, including fossils, which is known as the “Darwin Core.”

Data Modeling

Whether you build the database yourself, hire a designer to build it for you, or purchase a specialist collection database software package, you will need to define the different types of information that you will be collecting, how these data relate to each other, and how the database will store them. This process is known as “data modeling”.

At its most basic level, data modeling involves creating a definition for every field in the database: what information will be entered in the field; how long the field will be; whether the contents will be numbers, letters, or a combination of both; whether you will be able to type anything into the field, or choose options from a fixed list, etc. The resulting list of field definitions is called a “data dictionary.”

Once you have this, you can begin to think about how the different fields are related to each other: will it be a one-to-one relationship (e.g., a specimen can have only one catalog number), a many-to- one relationship (e.g., many different specimens may be collected from a single locality), or a many-to-many relationship (e.g., a specimen may be collected by more than one collector, and a collector may collect many specimens). These relationships will help you decide how many tables the database will need to best store your data, and how these tables should be linked together. The end result is known as the data model.

Data modeling requires you to think carefully about your collection and how it is used. It can be a very tedious and time-consuming exercise, so there often is a temptation to skip it in the rush to get data digitized and rapidly available on-line. However, it’s no exaggeration to say that most of the problems that arise in database design projects come from inadequate or insufficient data modeling, so avoid this temptation.

Backing up Your Database

The data contained in a collections database are of paramount importance and so it is essential to have procedures in place to back up your data to protect against data loss or corruption. You should have a plan in place for backing up your database on a regular basis onto some media that, ideally, can be stored off-site. At its simplest level, backing up may involve copying your database to an external source.

What is the difference between a flat file and relational database?

For small institutional and private collections it may be sufficient to store information in an easily available spreadsheet program like Microsoft Excel or another flat file database (where all the information is kept in a single large table) or a relational database program (which allow powerful searches, cross-referencing of data, and association of specimen images and other media).

In a relational database, information is kept in various tables linked together by use of a common field, such as the specimen’s accession number or catalog number. The advantage to this approach is that data need to be entered only once. For example, locality or excavation information entered into one table can be linked to multiple specimens.

For an individual collector or small institution, it might be enough to have your database on one computer terminal that is shared by everyone who needs access to the information. In larger institutions, however, it is essential that multiple staff members and researchers are able to access a collections database at the same time; this requires moving from a stand-alone system to a client-server model, in which multiple computers are networked. The end-user at a client terminal connects to the central server to access the database information.

This ensures that all users are accessing the same, up-to-date information. No matter what system you use, be sure to regularly backup the database and store copies in other safe places.

As mentioned in the section on Cataloging, each discrete unit of data in a database is a field. It is better to have fields be short and specific, allowing them to be searched most effectively. It is always possible to combine fields later if you need to, but it is much more time intensive to separate data down the road.

Should I buy a database system or develop one in-house?

For smaller institutions, developing your own database “in-house” may seem like an attractive proposition because of low start-up costs. However, it can be extremely time-consuming; even when the design and development phases are completed, there can be significant issues (especially money and personnel time) involved with maintaining the database. Contracting with a database designer to build the system can cut down on the time taken to set up the database, but your developer may be unable or reluctant to provide long-term support, thereby reducing the long-term effectiveness and viability of your system.

There are some commercial software packages that are aimed at both individual collectors and institutions. A certain degree of customization is often possible, especially for some of the larger and more costly programs, where it may be possible to tailor the software package to perform functions that are specific to the particular institution. For a more extensive discussion on how to select software see the chapter on Computerized Systems in The New Museum Registration Methods. Canadian Heritage Information Network (CHIN) also has extensive information on the steps necessary in planning and implementing a collections database.

What data should I put on the web?

Many databases now allow for easy publishing of data records onto the web, where they can be easily accessed by researchers and other interested parties. This type of data-sharing has great potential to stimulate research and innovative education projects, but care must be taken in terms of what information is made available, especially to protect fossil sites.

What kind of data information should I be storing?

Potential fields can include, but are not limited to:

Object/specimen name
Description
Collection date
Collection site, town, county, state
Habitat/depositional environment, latitude, longitude, elevation, depth
Collector
Identified by and date
Cataloger and date
Condition
Value (at collection, current)
Dimensions and weight
Corresponding image/photograph numbers
Restriction on use, publication citation, reproduction

These Collection Management resources were originally developed in 2007 with the support of the National Science Foundation (NSF).