Scholarly metadata use several serialization formats. Some formats such as
in the 1980s) and anvl (opens in a new tab) require
a dedicated parser, something that should be avoided.
xml is mainly used by DOI registration agencies,
and parsing is more time-consuming and complex compared to metadata using
the strengths and weaknesses of
json (e.g. namespaces, but also added complexity).
cff) is a superset of
json and is easier to write for humans, but slower than
and not widely used for network calls, but rather for local files.
tomlis an alternative to
but I haven't seen it used for scholarly metadata.
json is the preferred serialization format for scholarly metadata, and is widely used
for that, in particular in REST APIs.
xml is mainly used for historical reasons, mainly for
DOI registration (although DataCite has depreciated this in favor of JSON).
The scholarly metadata formats take various approaches to what data type (e.g. string vs. integer,
or list/array vs. dict/object/hash) they expect for a given field. Some are more flexible, in particular
Schema.org JSON-LD, but that flexibility comes at a price in form of parsing overhead and breaking
tools. A related issue is the handling of None/null/nil which adds flexibility for optional metadata
fields but again adds overhead. Type safety can facilitate metadata parsing and is increasingly adopted
Legacy of Dublin Core
Dublin Core (and the formats heavily innfluenced by it such as
schema.org) have had an
important influence on scholarly metadata. Unfortunately some decisions made more than 25 years ago
haven't worked so well, and some of the newer metadata formats have made different decisions. Of course
all metadata formats sometimes made decisions that in hindsight didn't work so well (e.g. date-parts in
CSL-JSON or type
Crossref), so this is more a sign of the strong influence Dublin Core
had on the development of scholarly metadata, and the goal of being generic sometimes making specific things
One example is metadata only required for some resource types and therefore not part of Dublin Core, e.g.
page numbers, or journal name and issn. This requires complex workarounds or leads to incomplete metadata
you see when you for example want to express DataCite DOI metadata in a formatted
citation. Another example
(which you see predominantly with DataCite) is to lump all related resources together, in this case using
relationType, and since 2021 also
relatedItem. This makes the easy difficult,
in this case listing all the references used in a resource, some of them using a DOI, some of them only having
Author names is a major topic of complixity in scholarly metadata. Authors can be people or organizations,
and for people separate given and family names are needed for some scholarly metadata, e.g
citations. Metadata from formats that only provide a single
name field for authors therefore need
to be processed, and this can be tricky (e.g. as the name for a person can be either
family, given or
given family). Commonmeta (just like Crossref and CSL-JSON) uses
name for organizations, and
type (which can be either
distinguish between the two.
Dates are another major topic of complexity in scholarly metadata. Scholarly resources can have multiple
dates associated with them (e.g.
publication_date), and each
date can be expressed with different levels of granularity (e.g. year-month-day, year-month, or year). Then
there are dates of different granularity (e.g.
Summer 2017), approximate dates, and date ranges.
Extended Date and Time Format (EDTF (opens in a new tab)) and the underlying
ISO8601 (opens in a new tab) is the community standard to
cover these various dates and times, but is only partially supported by most metadata standards and tools.
date-parts to define partial dates and date ranges.
Describing the type of scholarly resource is important, but unfortunately many vocabularies exist. As they all have their strenghts and weaknesses and no existing vocabulary handles all resource types encountered in commonmeta, commometa is building an internal vocabulary to support as many mappings as possible.
Versioning has historically not been much of an issue with text publications, where at most we had a handful of
versions that could be sorted chronologically and one current version. With datasets and software this was
always different with easily hundreds of versions that sometimes have a complex relationship to each other.
This kind of versioning is best described by
isVersionOf relationships, and more work
is needed for commonmeta to address this. This topic is also relevant for preprints, where preprint and journal
article are published in different places and the relationship is not generated automatically.
Collections and Parts
Again a topic historically not much of an issue with text publications. Besides the obvious: books can have chapters (possibly with different authors), and journals have issues which in turn have articles. Again this topic is much more complicated with datasets or software, where the decision for how to split up the data or code is almost arbitrary. Something commonmeta needs to support.