Big Data : Big Opportunity or Big Problem?

By Patrick J. Burns, VP-IT, Colorado State University

Patrick J. Burns, VP-IT, Colorado State University

If you have not seen the movie “Moneyball” starring Brad Pitt as the manager of the Oakland A’s, I highly recommend it. The A’s were the first baseball team to hire a statistician to analyze what makes a baseball player effective, and they hired players accordingly. As a result, the A’s went to the World Series in 1981 with by far the lowest total salary in baseball. This degree of success fundamentally changed the game of baseball to rely heavily on statistics, and all sports have followed this trend, some to the extreme. Now, Big Data seems to be the newest, shiniest direction in higher education these days, especially for the promise to improve student success. Many institutions define student success purely statistically by factors of student persistence, retention, and completion.

However, we at CSU define success first as student performance, i.e. learning. We feel strongly that improved learning will improve the traditional factors of persistence, retention, and completion. In addition to the moral imperative that obligates us to ensure that the students we enroll succeed, there is the additional benefit of increased tuition revenue when they do. Every percentage we increase our student graduation rate increases our tuition revenue by almost $1 million. Indeed, our target to improve our student graduation rate by 10 percent ought to yield additional revenue to deploy large-scale educational analytics and additional student services to support and enhance student success. Institutions that do not engage in such “educational moneyball” will be left behind as other institutions rely on ever-more detailed and comprehensive statistical analyses of student success.

Of course, higher education has been dealing with “Big Data” for over half a century. Big Data activities are prevalent in the sciences and technical disciplines where extremely large-scale experiments and high performance computing simulations yield massive amounts of data. While institutions continue to struggle a bit with the number and size of research files, and organization, storage, backup, and transport of those files, we are progressing acceptably in this arena.

“Every time we turn around, we are confronted with a new, vended system that claims to have the “secret sauce” to learning analytics”

This article will explore the new arena of Big Data learning analytics. Every time we turn round, we are confronted with a new, vended system that claims to have the “secret sauce” to learning analytics: “Just buy my product, and we will solve all of your problems.” Vendors seem to be adding analytics into their systems haphazardly, with no solid evidence of correctness of approach or validity. Indeed, a dizzying array of learning analytics options are emerging, each using somewhat overlapping and somewhat disjointed data. What to do, what to do?

Our observation is that Learning Analytics as a discipline is in its earliest, nascent phases. Formally, learning analytics is a problem of very high dimensionality with numerous yet little understood independent variables, and unfortunately the number of dimensions changes as different analytical questions arise. Learning analytics is not a well-understood realm.

Indeed, thinking about how to progress with Big Data learning analytics requires a process flow. First we ask: In which areas initially does our institution wish to engage in learning analytics? This scope then determines the research questions that next need to be defined. The research questions then determine the data elements needed. Then, central IT must determine how to house, preserve, and make accessible that data, including hardware, software, processes, and systems (and here is where it starts getting potentially very expensive). Then, the institution must determine the organizational unit(s) responsible for conducting the learning analytics studies.

Finally, and most importantly, the institution needs to support implementation of additional student success services to actualize the results of the analyses. This process flow is akin to “This is the house that Jack built,” and motivates   institutional change. At Colorado State, we have experimented with “medium data” learning analytics for just our Calculus I class, and are seeing significant success via sophisticated regressions we are performing on our Cray computer using the R statistical package. Data for that must be extracted from multiple systems, transformed to the format required, and loaded into the R statistical package (ETL–extract, transform, and load). This process is very staff-intensive, and deucedly difficult.

It is so difficult, in fact, that we are in the process of completely re-architecting our data environment to support this type of large-scale learning analytics. Many of my colleagues indicate similar Big Data-driven efforts are underway at their institutions. Early results for our Calculus I course indicate excellent success: over 90 percent match predicting what we define as our “at risk” population–students who receive a grade of C, D, F or W.  Interestingly, using a student population of the last five years yields a better fit than using the last ten years (student demographics have changed too much in the last decade), and we have seen the gender gap (females not performing as well as males) in the ten-year population disappear in the five-year population. Glorious! But, every good research study inevitably yields many more, deeper questions. For example, now we are asking, “What makes students who take Calculus I a second time succeed?” I could go on and on.

So, Big Data does indeed present a tremendous opportunity in the area of learning analytics and student success, but it remains a daunting problem due to its complexity—many of our institutions, our data, our systems, and our processes are not yet organized to support Big Data learning analytics, much less the implementation of additional services suggested by the analyses. Yet, the potential payoff is so significant that we will self-organize to solve it, and we will see our environments change and adapt just as they have in professional sports. The game is afoot!