Learning objectives
- Understand motivation for code and data management
- Know how to organize code, data, and results
- Be able to create and use an RStudio project
The scientific process is naturally incremental, and many projects start life as random notes, some code, then a manuscript, and eventually everything is a bit mixed together.
Managing your projects in a reproducible fashion doesn’t just make your science reproducible, it makes your life easier.
— Vince Buffalo (@vsbuffalo) April 15, 2013
A good project layout will ultimately make your life easier:
Although there is no “best” way to lay out a project, there are some general principles to adhere to that will make project management easier:
This is probably the most important goal of setting up a project. Raw data should never be edited, because you can never be sure that you will want to keep any edit you make, and you want to have a record of any changes you make to data. Therefore, treat your raw data as “read only”, perhaps even making a raw_data
directory that is never modified. If you do some data cleaning or modification, save the modified file separate from the raw data, and ideally keep all the modifying actions in a script so that you can review and revise them as needed in the future.
Anything generated by your scripts should be treated as disposable: it should all be able to be regenerated from your scripts. There are lots of different ways to manage this output, and what’s best may depend on the particular kind of project. At a minimum, it’s useful to have separate directories for each of the following:
RStudio has a feature to help keep everything organized in a self-contained, reproducible package, called a “project”.
A project is a small file with a .Rproj
extension, but you can think of all the files and sub-directories as belonging to that project. We recommend creating a directory and a project file for each project you work on. It should look something like this:
When you want to work on this project using R, double click on the .Rproj file, and RStudio will open it and keep everything organized for you. You can also open an existing project from RStudio by clicking “File -> Open project…”
Challenge – Create a Project
Let’s create a new project in RStudio that will contain all of our work for today.
- Click the “File” menu button, then “New Project”.
- Click “New Directory”.
- Click “Empty Project”.
- Type a descriptive directory name: This is the title of your project, so for this one you might use “DataCarpentry-day1” or something similar.
- Store the new directory in a sensible place in your computer’s organizational scheme. If you have a “workshops” or “classes” directory that would make sense. This will create a new directory called “DataCarpentry-day1” in that directory.
- Click “Create Project”.
If everything went right, RStudio should’ve flickered and you should be looking at a pretty bare RStudio instance. That’s okay. Click on the “Files” tab in the lower right pane. Your .Rproj file should be there with nothing else. You’ve got the bare bones of a new project. Let’s now create the directory structure described above, a folder for each of data, code, results, and papers. You can do this in RStudio by clicking on the “New Folder” button in the Files pane, or in your OS by navigating to the directory you just created.
Challenge – Project Organization
- In your project directory, either using the Project tab of RStudio or your OS’ file system, create the following directories:
- data
- code
- results
- papers
The finished product should look something like this on your computer:
This lesson is adapted from the Software Carpentry: R for Reproducible Scientific Analysis Project management with RStudio materials and the Data Carpentry: R for data analysis and visualization of Ecological Data Before We Start materials. .