Aller au contenu principal

Research data management

Definition

Research Data is the "raw material" from which scientific research produces and justifies its results. To be considered scientifically sound, all research results must be based on the analysis of primary or secondary data, whatever the scientific discipline.

The Organisation for Economic Co-operation and Development (OECD) defines research data as "factual  records  (numbers, text, images  and  sound),  which  are  used  as  primary  sources  for scientific research and are generally recognized by the scientific community as necessary to validate research results" 5.

Research data is what makes scientific knowledge possible. It is the basis for the administration of evidence.

Research data Not considered research data
  • Database
  • Code, algorithm
  • Sketches (drawings, sketch notes)
  • Data mining (results)
  • Text document
  • Audio recording
  • Video recording
  • Geospatial
  • Measurements
  • Physical object (e.g. artwork, prototype, textile)
  • Photography
  • Protocol
  • Questionnaire
  • Report
  • Survey
  • Statistics
  • Transcription
  • Preliminary analysis
  • Publication appendix (e.g. graphic or image)
  • Personal communications with colleagues
  • Peer reviews
  • Project administration file
  • Future work programs
  • Scientific document projects
  • Text from publication

Type of data

CNRS's INIST defines 5 types of data4:

Type of data Definition Examples
Observational data Data gathered in real time, usually unique and therefore impossible to reproduce Questionnaires, interviews, neuroimaging, photography
Experimental  data Data  obtained  from  laboratory  equipment,  often  reproducible  but sometimes costly Chromatograms, DNA chips, trials
Computational or simulation data Data generated by computer or simulation models, often reproducible if the model is properly documented Meteorological data, earthquake simulation data
Derived or compiled data Data derived from the processing or combination of "raw" data, often reproducible but costly Text mining, MRI imaging, compiled databases
Reference data Collection or accumulation of small datasets that have been peer-reviewed, annotated and made available Gene databases, old image databases, archive collections

It's important to bear in mind that data is never "given" but rather "obtained" 3 through processes involving humans and/or machines. Indeed, if temperature is data for meteorologists, the data they process is factual records obtained from a transmission chain involving sensors, transmitters and receivers. In the same way, if an ancient text is data for the historian, the latter has had to discover its existence through research, obtain authorizations to access it, scan it, translate it, reconstitute it... By the same token, physicists and psychologists obtain data from experiments, sociologists from questionnaires and interviews, geographers from photographs and maps, archaeologists from excavations and the dating and classification of samples, epidemiologists from laboratory analysis (e.g. test results), and so on.

Data is therefore information that has been produced by a methodological process involving human and non-human agents. Every scientific discipline benefits from reflecting on its own modes of data production, so as not to confuse data with the reality it seeks to capture.

Data characteristics

Primary or secondary

When a research protocol produces its own data, it is referred to as primary data. But not all scientific research systematically produces its own set of data before carrying out its analysis. Research data may in fact be produced and supplied to research teams by other teams who have shared their data on data repositories, or by third-party organizations responsible for building databases (e.g. national observatories). Secondary (or second-hand) data is when research teams exploit and analyze data they have not produced themselves.

Formatted and grouped

Research data needs to be processed so that it can be read, understood, contextualized and linked together. Once formatted and grouped together in the same space, it forms a corpus or data set. Only once it has been assembled can it be analyzed, since the demonstration of scientific correlation and/or evidence is based on the search for and analysis of repetitive patterns.

Sensitive

Some personal data may be qualified as sensitive and require special precautions to ensure that its use does not harm individuals (see "Personal data" in the glossary). This includes health data.

Integrated

Obtained as part of the implementation of a research protocol, data is in itself a scientific product which both has a value (scientific, historical, but also commercial) and a certain level of confidentiality (sensitive information, intellectual property). It is therefore imperative that data is stored securely, and its sharing must be regulated to ensure that its use is not misappropriated for political or commercial purposes.

FAIR

In the context of Open Science policies, data must be FAIR, i.e. easy to find (Findable), accessible (Accessible), interoperable (Interoperable) and reusable (Reusable). To designate all the operations involved in formatting, recording and sharing data in compliance with Open Data policies, we now speak of "FAIR data" or "data FAIRization". More information on the FAIR principles

Quantitative or qualitative

Depending on the discipline, data can be quantitative (coded data in large quantities) or qualitative (observational data, speeches and texts requiring interpretation). Although these two types of data can be used in a complementary or interconnected way (grounded theory), they fall under two distinct methodologies:

  • Quantitative data is based on a "hypothetico-deductive" method: the research hypothesis precedes the production of data, and the aim of data analysis is to confirm or refute the working hypothesis. This type of data is primarily used by experimental sciences.
  • Qualitative data is based on an "empirico-inductive" method: the hypothesis is developed and refined during the data production and interpretation phase. This type of data is primarily used by human and social sciences.

References

  1. Delamadeleine, C. (2023). Guide rapide de la gestion des données de recherche (p. e0230416). HES-SO. https://www.hes-so.ch/fileadmin/documents/HES-SO/Documents_HES-SO/pdf/open-science/liens-utiles/Brochure_Guide_Rapide_V20240214.pdf
  2. GO-FAIR (2022). FAIR principles. https://www.go-fair.org/fair-principles/
  3. Fournier, T.  (2014). Les données de la recherche : définition et enjeux. Arabesques, 73, 4-6. https://dx.doi.org/10.35562/arabesques.985
  4. Institut de l'information scientifique et technique. (2014). Une introduction à la gestion et au partage des données de recherchehttps://www.inist.fr/wp-content/uploads/donnees/co/Donnees_recherche_web.html 
  5. Latour, B. (1996). Petites leçons de sociologie des sciences. La Découverte.
  6. Organisation de Coopération et de Développement Économiques (2021). Recommandation du Conseil concernant l'accès aux données de la recherche financée sur fonds publics. https://legalinstruments.oecd.org/fr/instruments/OECD-LEGAL-0347
  7. URD Data (2022). Les données de la recherche. https://data.ird.fr/gerer/quelles-donnees/#Les_types_de_donnees