Scaling Up Analysis in a SEISMIC Measurement Project
By Eben B. Witherspoon
The AP Project in the Measurement Working Group has been moving quickly through their project milestones. After starting strong last fall and developing shared analyses for the initial three SEISMIC institutions involved, they have started “scaling up” their project to the other seven SEISMIC institutions. While the three initial institutions are ironing out the details of their final models, the scale-up institutions are in various stages of cleaning their data and getting it ready to process. In this scale-up process, our AP Project team has learned much about what it takes to run a project across SEISMIC, and what key challenges can come up. We hope this post will provide some support for our fellow Measurement projects as they start preparing to scale up their efforts!
1. Set Common Variable Names
The first thing to consider when scaling up a Measurement project is setting common definitions and variable names for key concepts in the project. It was surprising to us how much variation there was by institution in seemingly straightforward terms such as “cohort.” Taking the time to clarify these not only facilitates discussions during meetings, but is important for ensuring that variables are being defined and generated in the same way, and that eventually, shared analysis code can be run on each institution’s dataset.
Variable definitions and names from initial discussions in the AP and Demographic projects were merged in a working document to categorize variables across the Working Group. This could be a good jumping off point for coming up with SEISMIC-wide institutional data variable names.
2. Attend to Dataset Formatting
Second, it is essential to decide up front on the format of the dataset for analysis (i.e. long vs. wide). Unless this is explicitly discussed, it’s easy to assume everyone is doing it the same way (but there are lots of ways people store and think about time-series data!) and this has big implications for doing shared analyses later. For example, in the AP Project, we eventually landed on a combination similar to “panel data” – our data is wide by student (i.e. each row is a single student) and stacked long by each discipline (i.e., all observations who took BIO are “stacked” on top of all observations in PHYSICS), with every student unique within each discipline, but able to be repeated across disciplines (i.e. if a student took both BIO and PHYSICS). This made the most sense for our project, as it allowed us to easily subset our analyses by discipline. It might be overkill, but creating a mock-up dataset could even be helpful to visually represent how the data looks, which variables are time-invariant or not, etc.
3. Share Model Specifications and Basic Descriptive Stats
Last but certainly not least, once you have settled on your Research Questions (RQs), it is very helpful to share clear model specifications including DV, IV and analytic sample. Even when the same model is being run, misinterpretation of patterns across institutions can easily be missed when looking at only regression tables when there are different understandings of what sample is being analyzed. Sometimes something as simple as looking at sample sizes can help catch these discrepancies early on. For example, if two schools of about the same size have vastly different Ns, something may be up. One way our project addressed this issue was by moving the part of the analysis that defines the sample (which was previously done in each individual institution’s data cleaning) to the shared script, so that each institution was literally running the same code to subset the data and generate the sample for each analysis. Of course, in order to do this, there first needed to be common variable names and similarly formatted datasets…hence parts 1 and 2 🙂 As an added “bonus,” these checks and balances worked together; if our shared code couldn’t run or gave us weird results, this led us to uncover previously undiscovered issues in our variable or sample definitions!
We recommend each project save their analysis code in the SEISMIC-wide GitHub repository (email your GitHub username to firstname.lastname@example.org to join). This is a great way to share code and track changes, without making overlapping edits. Our project also used R/RStudio and Google Co-Lab with Jupyter notebooks to share, run, and comment on each other’s code as we were developing it. Then, we saved the agreed-upon code in our AP Project (WG1-P4) GitHub repository.
We have also found it helpful to use R-Markdown to create an “Analysis Workflow” file, which acts as a guide for AP project participants in understanding the analysis process overall, including how to create a dataset that will work with our shared analysis code. It captures much of our thinking on streamlining and simplifying the scale-up process. It also serves as a single location that links to various other relevant documents for running analyses (i.e. variable naming conventions, model specifications). In addition, the document itself is shared and editable, which allows notes to be added by institutions as more specific things pop up that might be useful to others (i.e. Pitt didn’t have a variable for the year AP was taken, so we developed and noted our work-around).
Calling All Demographics Project Analysts
We would love feedback from Demographics Project (WG 1 P1) analysts about our process and what you have been doing to coordinate. For example, have you found less complicated workarounds for the same issues? Are we missing key parts of scaling up that you’ve experienced? Let us know!
Interested in joining the AP Project?
Overall, the process for onboarding is:
- Join the Project GitHub repository (by emailing a GitHub username to email@example.com)
- Read the “Analysis Workflow” file (available on GitHub)
- Preview the SEISMIC variable definitions doc (also linked to in the Analysis Workflow file)
- Create an institution-specific folder in the WG1-P4 GitHub that contains the data cleaning files for that institution’s data – these will all be slightly different, but may be useful to see how others have done it as there will be some overlap.
- Once the data is in the same format, run the shared analysis file (available on GitHub).
- Join one of our meetings and share your findings!
Eben B. Witherspoon, Ph. D.
Eben Witherspoon is a post-doctoral researcher in the Learning Research and Development Center (LRDC) at the University of Pittsburgh. His main line of research examines attitudinal and environmental factors during the transition to college that influence retention in STEM career pathways for underrepresented students. Currently, he is working on a project looking at the factors influencing gendered attrition in the pre-med course sequence. Eben is an active SEISMIC member and works on the AP Project (Measurement Working Group, Project 4).