1 Answer
1
Accepted answer

"Data" are any set of values belonging to some corresponding set of variables. In an empirical context, these are almost always either observations of some kind ("raw" data) or values which have been derived from observation in some way. All data are "real" (even simulated data), but not all data are equally useful.

The first essential thing to understand is that measurement is inherently uncertain. Consequently, even "raw" data is subject to various forms of sampling error, which may include bias introduced by factors outside of the observer's control and/or awareness. Well-intentioned scientists using badly biased samples will likely misestimate the parameters that are crucial to their investigations. These biases may be discernable in the raw data.

That said, almost all data processing entails some loss of information (i.e. some increase in uncertainty). Even though information from every datum is used to calculate (for example) the mean, there is no way to recover those data if only the mean is reported. This not only prevents new analyses from being done, but also potentially conceals the signs of bias that might have provided important clues about why a past study yielded bizarre results.

Because the purpose of sharing datasets is to permit other researchers to both (1) verify that the analysis was done correctly and (2) to permit new analyses that yield new interpretations and discoveries, it is important that submitted data not be too heavily transformed or summarized. Thus, best practices entail reporting the values that are as close as is reasonable to the original measurements, so that the original character of the data can be ascertained and so that the resulting statistical analysis is up to the analyst, rather than having been constrained or determined by the original researcher.

If, in practice, the derived data is widely interesting and/or is particularly difficult to calculate (e.g. derived values estimated by brute-force calculation using a supercomputer), then it should be reported as well; meanwhile, if the raw data are truly massive, they may be too expensive or unwieldy to be made conventionally available. While these create special cases that must be considered individually, they do not undermine the general practice that honest data sharing entails giving third parties the option of performing your entire original analysis over again, starting from your original measurements, if they so choose.

waiting for moderation