The midterm is composed of two parts: an individual and a group component.
The individual component provides a description of the data set and examples both of the type of information that can be gleaned as well as visualizations that can be made. Be as concise as possible—fight the verbosity of academia!
The README must contain:
A brief overview of what the data set contains. This should be a short paragraph composed of short sentences describing the source of the data and the type of information it holds. This will look a lot like the opening paragraph of the Data Dictionary that you have, though possibly even shorter.
Example Insights Provide about 5-10 pieces of information that are notable and illustrate the nature of the data set. These should be both engaging and potentially useful to someone looking to analyze the data. Taken as a group, they should illustrate multiple ways that the data set might be used (i.e., not all referencing the same variable). For example: “The 311 system categorizes requests for service into 218 types, the most common being General Request.” “The neighborhood with the most 911 calls in 2013 was Dorchester.”
Visualization. Include at least two visualizations that illustrate some aspect of the data set that you find particularly noteworthy, complementing the “fun facts.”
Keep it as short as possible. No more than two pages should be necessary (usually). Remember that this is something that a visitor will want to read quickly to decide whether it is of interest. If it is too verbose you will be deducted points.
To review a strong example from a previous year, see attachment above.
Make sure your group includes:
The record-level file with all new and modified variables included.
An updated Data Documentation that describes the new variables and, if necessary, modifies the description of any already existing variables.
An R markdown document clearly articulating steps for all data cleaning and variable creation. This should be efficient and complete such that the code could be run all at once on raw data to recreate the updated data set.