This guide is for researchers, students and other scholars who want to learn how to find existing data to reuse for research. The guide is a result of a collaboration between Aarhus University Library (AUL), Copenhagen University Library (KUB) and Roskilde University Library (RUb), all part of the Royal Danish Library (KB). Note that access to certain sources that are mentioned throughout this guide may depend on your affiliation to a university or university library (e.g. Web of Science Data Citation Index).
Data are the foundation of most research – the raw material, if you will. Many researchers create their own data through e.g. measurements, observations or simulations, but a good deal of research is done by re-using data from other researchers or data that have been collected in another context. This includes data from citizen science projects, previous research projects, public registries or commercial databases, for example.
Data are being shared in many different ways and can be found in many different places – including places that are not indexed in the same way as e.g. peer reviewed articles are. This means that a different approach to searching data is needed - compared to searching articles. In fact, a combination of strategies is more suitable.
This guide introduces two distinct, yet complementary approaches to searching for data: a direct and an indirect search strategy.
The direct strategy involves targeted searches, focusing on specific keywords, databases, and repositories to locate relevant data. This method is ideal for researchers seeking precision and efficiency in their searches, and who know what they are looking for.
The indirect strategy invites researchers to explore broader contexts, delving into related literature, citations, and interdisciplinary sources. This method encourages serendipity (lucking out) and requires a certain amount of detective work. A dead end does not necessarily mean that the requested data do not exist - you may just have to find another route to it.
Throughout this guide, we'll not only look closer at these strategies but introduce resources for you to explore — from academic databases and archives to specialized research data repositories. The guide aims to equip you with the rights tools to navigate on your journey through the sea of information.
Aarhus University Library (AUL)
Find your liaison librarian at AUL
Copenhagen University Library (KUB)
forskerservice@kb.dk
Roskilde University Library (RUb)
forskersupport@kb.dk
Research data repositories: re3data
Published datasets:
AI-driven search algorithms, such as those found in tools like Google Scholar, often make it possible to ask for information in a natural language. Some AI functionalities are built into existing tools, while others are stand alone products. AI tools exist that can help defining search strings, suggest databases or other sources of data. There are literally thousands of tools. Some are free, while others come with a price tag attached. You can start looking for tools here: https://www.futurepedia.io/.
Searching for data is very different from searching for literature. There are no standardized search strategies, and you have to be prepared for a somewhat unstructured process when doing a data search. It resembles detective work.
Your approach ought to be flexible and comprise a combination of methods (see the tab on where and how to search). You should be open to where the search leads you. Searching for data thus involves an element of serendipity: you might find interesting data by chance. Expect dead ends and be ready to start over.
Photo by Everett Bartels on Unsplash
There are many approaches to searching for research data. Within certain fields you can find relevant datasets in field-specific repositories. In many cases, generalist repositories such as Zenodo might be great sources as well. You can also try to search for data repositories on websites such as re3data ("Registry of Research Data Repositories"). Datasets might also be attached to peer-reviewed articles or cited in scholarly literature. For more information about the different approaches - direct and indirect searches - see the tab on where and how to search.
Given the unstructured process of searching for research data, it might be rewarding to start your search in Google Dataset Search. Here, you can use simple keywords in a search engine that looks through a huge variety of data repositories. Another way to kickstart the search process is to make use of Artificial Intelligence (see box on the left).
Accessing data might also pose problems, in particular when they are not 'FAIR' (see box on the right). FAIR data should be available for long-term and retrievable through standard technical procedures (e.g. as download). 'FAIR' does not necessarily mean that data are open and available for everyone, but that there is sufficient information on the conditions and procedures for accessing and retrieving them. FAIR data should be mediated with a free and open protocol, and at least the metadata should be openly accessible.
Reusing data is only feasible when they are well-documented and curated in a concise manner, with information about their origin and context (or more broadly: the data provenance). The reuse of data can manifest either in accessing and validating the original study's findings (that is, data reproducibility) or in designing a new study on the basis of the original results (i.e. active data reuse).
Be aware of rights and restrictions. Not all data in a dataset might be Open Access, and you might have to contact a data owner or curator to request access. Furthermore, the extent to which you are allowed to use the data might differ from file to file. There can be many reasons for this, e.g. matters of sensitivity or confidentiality. In such cases, metadata are fundamental in that they can describe (the context of) the data without revealing sensitive information. Thorough metadata descriptions help to make the data FAIR. With good metadata, it might be possible to facilitate data sharing while ensuring the protection of individuals at the same time, for example.
A mantra of FAIR data, and Open Science in general, is to make data "as open as possible, as closed as necessary".
The FAIR principles focus on the preparation of research data so others can find, understand, and reuse them. FAIR stands for Findable, Accessible, Interoperable, and Reusable. The FAIR principles provide guidelines for good data management aiming at keeping data valuable for reuse in the long-term. Adopting the principles can help to improve the entire process of research data management. Making research data FAIR enhances the visibility of one's research and can elevate its reliability and reproducibility.
Read more:
Deutz, Daniella Bayle et al. (2020): How to FAIR: a Danish website to guide researchers on making research data more FAIR. Göttingen University Press. doi: 10.5281/zenodo.3712065
Engelhardt, Claudia et al. (2022): How to be FAIR with your data: A teaching and training handbook for higher education institutions, doi: 10.17875/gup2022-1915
Wilkinson, Mark D. et al. (2016) The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3:160018. doi: 10.1038/sdata.2016.18