This is a great introduction to these ideas and I have just assigned it to my students and will be requiring my technicians to read it. One thing to consider is mentioning how these principals can be applied at the data collection stage, which will hopefully widen your audience even further. The idea of "no empty cells" is especially relevant during data collection - I have spent hours of my career starting at empty cells on old datasheets wondering why nothing was put in them - could the data not be taken? Did they forget? ... Again, thank you for this contribution.
This is a very useful guide to practice, but here are some suggestions for consideration:
- A rectangular data layout is recommended, with "rows corresponding to subjects and columns corresponding to variables". A more general statement would be that rows should correspond to "observations" rather than subjects. A subject may contribute many observations, and different research fields have units of analysis other than "subjects".
- The recommendation is made to store copies of the data in .csv files. If the data is required to be in .csv format for analysis, then ideally that is also the format it should be in for data entry within the spreadsheet environment. i.e. the spreadsheet can open the .csv file and save back to .csv format. Having two copies of the same data leads to the easy possibility of conflicts arising between them: each time a change in the native file is made, it must be remembered to export the corresponding .csv file.
The alternative is to work with the native spreadsheet format and to use the appropriate API or import filter within the analysis environment to directly import that native format. In either case, having a single canonical data file reduces the possibility of data conflict. In particular there are now reliable packages in R for importing Excel files and online Google Sheets directly (such as readxl and googlesheets).
- More could be made of the advantages of working with online rather than file-based spreadsheets. Google Sheets makes the data available via the internet to all users who have been granted permission to view or edit it. Thus simultaneous analyses can be conducted, without regard to what network or particular computer the user is on. It also allows simultaneous multi-user access, with the ability to directly communicate with other users via a messaging system within each document. Google Sheets also provides an indefinitely long change history, so that individual errors and alterations can be reverted. By contrast, with native file-based formats, one generally relies on file-level version control, so that changes might only be able to be reverted in total from the last time the file was backed-up (version control software generally provides no more granular control of such non-text based files). Within file-based spreadsheets, the action-by-action ability to undo changes generally persists only for the duration of a session. That is, unlike with Google Sheets, changes from a previous session are not able to reverted when a file is re-opened.
You can also choose to receive updates via daily or weekly email digests. If you are following multiple preprints then we will send you no more than one email per day or week based on your preferences.
Note: You are now also subscribed to the subject areas of this preprint and will receive updates in the daily or weekly email digests if turned on. You can add specific subject areas through your profile settings.
Usage since published - updated daily