The Data Access and Research Transparency (DA-RT) initiative is driving a wedge into the community of scholars. Generally, I must admit that I can certainly not speak with great authority on this subject having only recently finished my PhD. Moreover, I am not exactly a method-driven scholar even though I do read methodological works with genuine interest. But since I have recently put my name under a petition requesting a delay and further discussion of DA-RT, I feel compelled to record my stance on the matter in greater detail.
Essentially, DA-RT requires scholars that cited data are available at the time of publication through a trusted digital repository. The American Journal of Political Science (AJPS), for example, in its guidelines on preparing replication files (p. 24) states that all accepted manuscripts need to provide replication files before the article enters the production stage through the Harvard Dataverse Network. DA-RT has been lucidly analyzed by much more established scholars than me (see, for example, Jeffrey Isaac’s editorial in Perspectives on Politics and his blog entry as well as Rick Wilson’s blog entry on the subject). Nevertheless, in what follows I would like to share what are to me more practical deficiencies of DA-RT that I have not seen addressed yet, particularly concerning questions of public access to data.
There are three points that I would like to make as far as quantitative work is concerned. First, DA-RT will further increase the already large burden on reviewers as they will have to check the replicability of findings – or face the backlash from the scholarly community if accepted manuscripts turn out to be not replicable. While AJPS has managed to arrange the replication of accepted manuscripts by the Odum Institute for Research in Social Science at the University of North Carolina, one can doubt that all other journals signing up to DA-RT will be able to outsource this step. Second, I cannot understand why journals like AJPS are endorsing specific repositories. What speaks against leaving the choice to authors? Third and most importantly, the liability question is the elephant in the room. Harvard’s Dataverse in its terms of use assumes no warranties or liabilities. It even fails to guarantee that services are “free of viruses or other harmful components”. Would you call this a trustworthy repository? If harm comes to anyone from my material, who will stand up to legal indemnities? Along similar lines, European scholars should be concerned about uploading their data to American servers due to the different privacy standards. The unintended consequence of DA-RT may be that European scholars will be discriminated against presenting their research in American flagship journals.
DA-RT seems to have been drafted primarily with quantitative research in mind, as the above arrangement with the Odum Institute underlines. But its standards are extending to qualitative research as well (see, for example, this newsletter of the APSA section for qualitative and multi-method research). Scholars relying on interviews or field notes in their research, for example, may therefore be equally required to upload digital transcripts to an online repository. Again, I have three concerns. First, this would require researchers to prepare full transcripts in the first place. This comes with a significant additional workload and takes time away from areas where it might be more wisely spent. Not all of us will be able to hire research assistants to clear this hurdle effortlessly. Second, if I conduct a series of one-hour interviews on Free Trade Agreements of which I spend ten minutes talking about negotiating directives and finally publish an article on this particular aspect, will I be required to provide full transcripts or only the ten minutes in each interview that served as the data for this piece? Third, the question of liability by using digital repositories does not go away. Whether all of this will affect interviewees’ disposition to talk to scholars remains to be seen but cannot be dismissed out of hand.
None of the above should be interpreted as being directed against data access and replicability, in principle. What makes this debate so delicate is that everyone voicing concerns about DA-RT is faced with the assumption of attempting to cover up bad research practices. Formally delaying DA-RT may be unwarranted. Editors are experienced scholars that will use their discretion to tailor these standards to the needs of individual pieces of research. But let me end with two more general observations. First, no standard will prevent individual scholars from cheating if they really want to. DA-RT may not even raise the bar significantly for those that do. Second, DA-RT will require only small adjustments for scholars publishing in the journals subscribing to it. Much of what is included is best practice in top-quartile political science journals already. In this sense, the argument that DA-RT stirs may be a boon because it focuses attention on general principles underpinning “good” scholarly practice. But it is neither a panacea rooting out “bad” research nor should we let a few rotten apples spoil our most valuable currency – the trust that all of us conduct their research in good faith. If disagreeing with DA-RT is interpreted primarily as an attempt to cover up flawed research, the process may turn out to become a bane for the community of scholars after all.
Update: An earlier version of this post erroneously linked to the terms of use of Harvard Web Publishing rather than Dataverse.
DA-RT doesn’t actually require any journal to adopt a policy of putting submissions through the sorts of replication efforts that AJPS currently does. See http://www.dartstatement.org/#!blank/c22sl – the language talks about journals expecting authors to clearly explain how the data were analyzed, e.g. via the provision of code in whatever application was used. AJPS’ policy is going above and beyond the DA-RT language – not inconsistent with it, but not required by it either.
As for requiring authors to place data in specific repositories, e.g. AJPS’ Dataverse node, that’s presumably in part because it’s an easier way of increasing compliance with the policy than in leaving it up to the authors, who may or may not follow up on whatever commitments they made to place the data somewhere. And note here that DA-RT doesn’t require specific repositories, per se, but specific categories of respositories. “Trusted digital repository” has a specific meaning for data archivists, who were involved in crafting what became DA-RT. See, e.g., http://www.icpsr.umich.edu/icpsrweb/content/datamanagement/preservation/trust.html for additional explanation. More generally, academics aren’t necessarily good data archivists, and placing data on a personal site is a decidedly imperfect way of sharing data. See, e.g. section 3 of “Creating Conflict Data” at http://conflictconsortium.weebly.com/standards–best-practices.html.
With regard to Dataverse and its liability language, I wouldn’t be surprised if that was boilerplate language that a legal office somewhere required to be included. Read the terms of use for data resources and you’ll often see some sort of disclaimer to the effect that the data provider makes no guarantee as to the quality/accuracy of the data or whatnot. The language can be a bit jarring, I agree, but it’s again something that a legal office will often insist be included. With regard to Odum’s Dataverse specifically, it does have a Data Seal of Approval – see http://www.odum.unc.edu/odum/contentSubpage.jsp?nodeid=688 for the meaning of that certification and what goes into achieving it.
As for the privacy considerations, what is the difference between having a de-identified data collection at the ICPSR or in a Dataverse somewhere rather than having it at, say, GESIS (which is a great data archive)? I’m a bit unclear about this. Data-sharing policies routinely make exceptions for considerations pertaining to confidentiality, sensitive data, proprietary sources, etc., and will generally exempt authors from providing the actual data as long as there’s still clarity over the collection and cleaning and analysis of the data. And DA-RT is no exception here, as it makes allowances for such considerations as well. In other words, if we’re talking about micro-level data on individuals or households or firms or whatever, whatever data that would be available to the public would be stripped of direct ID’s, would have indirect ID’s aggregated to reduce disclosure risk, and so on. Provision of such data is, in part, the reason for being for archives like GESIS, ICPSR, UKDA, etc. As long as the researchers have taken the appropriate steps here, how do the privacy considerations differ based on the geographic location of the archive?
Dear Robert,
Thank you for your comments and for giving me the opportunity to clarify some of my concerns.
On the question of replication before publication I see your point. But when requiring authors to submit their data for peer review isn’t it fair to expect referees to actually try and replicate the findings? If non-replicable research still manages to pop up in peer-reviewed journals many benefits of DA-RT are lost (e.g., saving space for replicable studies). It is common knowledge that too many published articles are next to impossible to replicate (for various reasons). Hence AJPS rightly raises the bar in that regard and other journals subscribing to DA-RT will have to get over it, too. My guess is that this is what will happen as DA-RT journals will face increasing pressure for publishing replicable research.
With regard to specific repositories I stick to my initial point that there are various “trusted digital repositories”, even after applying the meaning attached to it by data archivists. There is no good reason for requiring authors to use Harvard’s Dataverse. On the related aspect of liability, I would be troubled if Harvard really did use “boilerplate language” for their repository. This is not a minor issue and, if authors are required to use it, liability issues should not be laid squarely on their shoulders.
Finally, the point of privacy concerns merits closer inspection. Ingo Rohlfing has made a similar point on Twitter, basically arguing that the data is public anyway. The problem is, however, that data may not be public right away as an embargo period may be set before disclosure. This would be a problem for qualitative researchers as well since interview transcripts may be uploaded. Perhaps this is a completely bogus point. I am not a privacy lawyer and may be completely off track here. But with the European Court of Justice (ECJ) just striking down the Safe Harbor agreement with the US, I found it worthwhile to include this aspect in my post.
Thanks again for taking the time to comment in some detail on my concerns. I really do appreciate it and have benefited from your insight!
Sincerely,
Markus