Some thoughts on 'General discussion of data quality challenges in social media metrics'

Zohreh Zahedi and Rodrigo Costas recently published a comparison of altmetrics data providers. Included in the comparison was Crossef Event Data, the service that I have been designing and building for the last couple of years. I am writing this blog post as a personal response to their study, “General discussion of data quality challenges in social media metrics: Extensive comparison of four major altmetric data aggregators”. We will also publish an official Crossref response, which I will link to when it is published.

I should first thank the authors for including Event Data in their comparison. The service is still in beta and relatively young. A year ago, when the sample data was collected, it was even younger. We hope to launch very soon.

Whilst Crossref Event Data was compared alongside altmetrics providers, I should make clear that Crossref and DataCite are not in the business of making metrics. The services that we were compared to, Altmetric.com and Plum Analytics, collect the same kind of data, but their ultimate aim is metrics. Our purpose beigns and ends with collecting this underlying data so that anyone can analyze it. It could be used to make metrics, but could be used for a lot more other purposes besides.

Event Data is a service for the community, and will benefit from scrutiny, feedback and data from the community. It is my strong belief that this kind of infrastructure should be situated not behind closed doors but entirely in the open. Studies like this are vital for reliability and trustworthiness, and I welcome this analysis with open arms. Open data means open scrutiny and analysis, which in turn leads to greater quality, coverage and confidence in the data.

Zohreh gave me the opportunity to give feedback on an early draft, and I was not able to find the time before now. I personally apologise for that. Where I offer corrections on preventable inaccuracies, the fault is all mine!

This blog post discusses the specifics of Zahedi and Costas (2018), with a focus on its analysis of Crossref Event Data. But this is also an opportunity to reflect more generally on some of the issues around collection of this kind of data. I’ve written a companion blog post, Five principles for community altmetrics data. It connects ideas that have swirled around at conference talks and discussions over the past few years, and comes at the end of a Beta period as we bring Event Data to market. Hopefully the two blog posts balance specific response with general principles. I encourage you to read it first!

Opening challenges

The introduction opens with a discussion of the challenges.

… questions such as from where, when and how social media data has been collected and processed become critical in the development of reliable and replicable social media metrics research.

This a very good starting point. Crossref is not in the business of creating metrics or conducting scholarly research. Instead, we want to support and enable this work by providing an open data pipeline for altmetrics data and running agents to collect data. All of our data is self-describing, meaning there is never any question about where, when, how and how a particular data point was collected.

We decided to supply data at the level that naturally answers these questions and, where possible, answers any further questions of provenance and processing that arise. For example, a our Twitter data explains, as an integral part of the data, when it was processed, who processed it, and precisely which software was used.

But it is one thing to say “this data point is supported by this evidence”. It is significantly more difficult to answer these questions of a metric in a useful manner. We have tried to make it as simple as possible to build easy-to-explain metrics with our data.

Choice of sources

Altmetrics is, by its nature, diverse. This study focusses on four data sources: Mendeley, Twitter, Facebook and Wikipedia. Of these, Crossref Event Data tracks two: Twitter and Wikipedia. These are all well-bounded, finite domains. There is a finite number of Tweets, Wikipedia articles, Mendeley and Facebook activities. Whilst they do not capture the diversity of altmetrics sources, they are very good as a basis for comparison between aggregators. As these are closed worlds, it is possible to ask “how much of this domain does a given service cover?”.

In contrast, it is impossible to talk about “all blogs”. We do not know about all the blogs that exist, we have no way of reaching them all, and they are running on a large number of platforms. I am particularly interested in tracking blogs, but they are not a good basis for comparison.

So, I think that the choice of sources is sound.

Choice of total as a proxy of coverage

The chosen method was to select a corpus of articles with DOIs, then compare each service based on the number of results that they returned for each DOI. I do not think it is a sound basis for comparing services. In my opinion this method has two problems:

Abstract citations vs reports of citations

Firstly, it’s important to separate out an abstract data point from the observation of that point. Each abstract data point (for example, a citation) is unique, but it may be observed, described and reported more than once, by different parties, and with varying degrees of accuracy. This is a familiar theme with citation databases. The same corpus of articles and their citation is reported by Crossref, Scopus, Web of Science, Google Scholar, Microsoft Academic, OpenCitations and others. But, as countless bibliographic studies have shown, some data overlaps, some doesn’t. Event Data is a platform for reporting observations of links. In theory, each tweet-cites-an-article Event should be unique, but it is possible that we report each one more than once (for example, via our dedicated Twitter agent and our general Web agent). As we open our platform up to more Agents in the community, we may see this happening more often.

Citation models vary

There is a further difference between an Event and the idea of an abstract citation. Not all platforms have ‘stable’ citations. Notably, in Wikipedia they come and go. As an example, a given Wikipedia page could be edited four times over its history, meaning that there have been five versions. If a DOI was added in version four, then there will be two Events, one to say “Page A version 4 references article B” and then again “Page A version 5 references article B”. We can interpret that a few different ways. We could say “Article B is currently referenced by Page A”. We could also say “Article B has been referenced by Page A for 2 out of the last 5 revisions”. Zahedi and Costas does mention that the increased Wikipedia figures for Event Data can be explained this way. But implicit in that statement is an acknowledgement that the data models differ substantially between sources, which means they cannot really be compared just by counting them.

Of course, you can get some limited useful data from counting Events. It can signal that we completely failed to find anything, or signal that a particular article has suddenly become popular. But beyond that, totals don’t provide detail.

You can’t tell overlap using totals

My second objection to using counts is that it does not take into account the individuality of the data points. Source A may have 5 points and source B may have 5.

The two sources may have exactly the same five data points, meaning that they captured the same data. However, as in the second example, they might capture five entirely different points and therefore have captured entirely different data. The difference is black and white: A and B are either identical or completely different. Using totals means you lose the ability to tell which.

In another example, Source C has 10, D has 10 and E has 5. Looking only at totals, C and D are similar to each other, and E is dissimilar to both. But if E is a subset of D and has no points in common with C, we can conclude that C and D may be closer than we thought. This could hint at, for example, overlap in data sources but divergence in processing. This insight is also missed with using only totals.

These are the kinds of questions I would like to see addressed. Do different altmetrics data providers overlap? Is there a common set of data points that they all captured? Are there specific types of data that one source gets right but another has difficulty with? We cannot hope to answer these unless we compare like-for-like.

To put it another way: We are comparing services that collect sets of data points. I think it is more worthwhile to compare the intersection of the sets rather than to simply compare their cardinalities.

Parameters of the data set

In the methodology section:

The data collection from all of the selected altmetrics data aggregators was done in exactly the same date: 2017 June 19th with the aim of minimising time effects in the data collection.

and from the results:

This lower coverage of tweets by Crossref ED can be related to the recent start of this service, which is still in its Beta version.

and later:

Also, the recent start of CrossRef ED may imply that they have started to collect tweets from their inception…

The Event Data service is a pipeline, representing each Event as an observation at a single point in time. The API allows you to narrow your query to Events collected in a particular time range. Our Twitter agent started collecting data in 2017, and did not collect any Tweets before that.

It is true that we have no data for Wikipedia and Twitter from before we started the service. To the person on the street, missing data is missing data. Studying the ability to gather all data for all time is one question, studying the completeness of the ability to gather data in general is another. They are both valid research questions, but I don’t think they should be conflated.

I agree that the question “how complete is the coverage?” is a useful question to most people. However, as the service become more mature and we cover a greater time period, that proportion should increase.

Quirks of Twitter data

All things digital are ephemeral but Twitter is an extra special case. Authors of tweets may delete or hide them. Twitter has a social contract with their community users, and the legal contract with people who consume their data: “Honor (sic) user intent”. When a tweet is deleted from Twitter it must be deleted from any services that collected it. Our Twitter compliance checker monitors our data and removes Twitter data from Events when necessary. The Event still exists, but it is marked as deleted and the metadata about the Tweet (ID and author) are removed. We think this is the right compromise.

The paper does cover this topic, but as Twitter is a focus, I think it should have played a more integral part of the analysis. After all a large proportion, around 5%, of Tweets that we gather are subsequently removed.

If you interpret Twitter data for the purpose of creating a metric, it is debatable whether or not deleted Tweets should count toward the total. That decision is up to the creators of the metric, and you could argue it either way depending on what you are trying to achieve. However, we are not trying to make a metric. Instead, we are trying to collect and distribute data in a much detail as possible.

We therefore don’t physically delete Events when the Tweet is deleted, we only only remove the sensitive information. If you look simply at the total number of Events in Crossref for Twitter, it will never decrease. When Tweets are deleted, we simply lose the connection between the Event and the Tweet.

Because of these differences in intention, simply comparing Tweet counts is only meaningful when the deletion policy is explicitly taken into account and quantified. This is not an easy task, as Twitter’s rules confound this kind of analysis (you can’t ask Twitter for deleted tweets, so you don’t know how many there are). I know of at least one study that does not follow their terms and conditions. Tempting though it is to break the rules, I cannot recommend it.

Changes since June 2017

The Event Data service has not changed conceptually since June 2017, but since then we have introduced a few supporting components. Whilst Evidence Records have been around for a long time, the Evidence Logs have now become available. They allow bulk analysis our our performance easy. Every day’s worth of activity is available in a single (large!) log file. This explains, in excruciating detail, everything that happened on that day. This includes all the times we found potential matches but they didn’t work out.

We have also changed our Wikipedia data model relatively recently. Previously we represented each conceptual Wikipedia article as a separate entity to its individual versions. We also represented the relationship between Wikipedia article versions. The new, simplified model, captures the same data (each reference in revision of each page), but in an easier to understand model. It is still a lot of detail, which means a lot of data.

Over the past couple of years we have experimented with different methods for Twitter compliance. We have now settled on a schedule for checking each Twitter Event for deletion. The result is still the same, but the behaviour is now more regular.

Lagotto History

No mention was made of Event Data’s history and, as it is being compared to Lagotto, it is worth pointing out. When we started the project a few years ago, we intended to use Lagotto. Through an experimental trial period, in which time I worked very closely with Martin Fenner (who is largely responsible for Lagotto), we refined the data model. What we call an Event in Event Data began life as a Deposit in Lagotto. The Event Data’s schema is the Lagotto schema.

Lagotto was created by PLOS (originally as PLOS ALM) to collect article level metrics for individual publishers (in the first instance, PLOS). As we trialled it at Crossref we realised that it didn’t align with our objective to collect references instead of metrics, and the architecture was unsuitable for running as industry infrastructure. Our DOI metadata records individual references (rather than simple reference counts) and individual author IDs, (not simply “how many authors”). In the same way, we wanted to collect individual Tweet IDs, not just counts.

I am very grateful for Martin’s work on Lagotto. He is now Technical Director at DataCite, and Event Data is a joint project between Crossref and DataCite.

Conclusion

The authors have done a great job in understanding a range of providers that all seem to do approximately the same thing, but all do it a little differently. Whilst I hope that future studies will compare sources in more detail, I am glad that they found a method that applies to the current commonalities of all of the providers.

Crossref Event Data is a new paradigm in an established field. The extant players in this field are vertically integrated (collect their own data and do their own analysis), and our entry, like our other activities, deliberately focuses only on the infrastructure part of the chain.

Community data lives and dies by community scrutiny and uptake. We need people to conduct studies and analysis, and by making everything open we can help foster inquisitive altmetrics consumers and researchers.

There is a balance to be struck between giving people something that’s easy to use and something that’s precise but requires more effort to interpret. The length of the Event Data User Guide is testament to how much effort has been put into explaining how each data point is collected, but it also illustrates the complexity of the system. We would like to make the data, software and documentation “as simple as possible but no simpler”. I hope we’ve made progress since the study was started, but there is always room for improvement.

I hope this study is the first of many!