### A thought experiment: How CRAN saved 3,620 (working) lives

Given the vast amount of R packages available today, it makes sense (at least to me, as a trained economist) to ask a simple yet difficult question: How much value has been created by all those packages?

Calculating our model with these parameters leads to the almost incredible amount of

If you assume a working life is between the age of 18 and 70 this means an amount of time has been saved by sharing packages on CRAN that is is equivalent to

For all those who want to do the math themselves, here is the R code I used:

library(packagefinder)

# Attention: The next statement takes quite some time to run

# (> 1 hour on my machine)

buildIndex("searchindex.rdata", TRUE)

load("searchindex.rdata")

package.count <- sum(searchindex$index$DOWNL_TOTAL)

package.count.corrected <- package.count*0.7

package.hours <- package.count.corrected*1

package.workweeks <- package.hours/80

package.workyears <- package.workweeks/(365/7)

package.worklifes <- package.workyears/(70-18)

As all R stuff on CRAN is open-source (which is a blessing), there is no measureable GDP contribution in terms of market value that we can use to provide a quick answer. But all of us R users know the pleasant feeling, if not to say the excitement, of finding a package that provides exactly the functionality we have been looking for so long. This saves us the time of developing the functionality ourselves. So, apparantly, the time saving is one way to estimate the beneficial effect of the package sharing on CRAN.

Here comes a simple (and not too serious) approach to estimating this effect.

(Side note: I am well aware of the extremely high concentration of capable statisticians and data scientists in the R community, so be clement with my approach, I am, as you will see shortly, not aiming at delivering a scientific paper on the matter, although it might be worthwhile to do so; if there are already papers on the topic out there, I am sure they have figured out much better approaches; in this case, please simply leave a comment below).

Without further ado, let's get right into it: Since the recordings began, the RStudio CRAN server has seen

**1,121,724,508**package downloads as of today (afternoon [CET] of July 14th, 2018) (this number has been generated by running through all the**12,781**R packages identified with the CRAN_package_db() function from the tools package, and adding up their download figures which I have retrieved from the CRAN server logs via RStudio CRAN's HTTP interface; this interface returns a JSON result which can easily be read using the fromJSON() function from the jsonlite package; to be a bit more precise: the whole operation was done with the buildIndex() function from my package packagefinder as this integrates all this functionality).Let’s assume 30% of these downloads are 'false positives', i.e. cases in which the user realized the package is not really suitable for his/her purposes (and of course, in a more sophisticated approach we would need to account for package dependencies, as well; we neglect them here for the sake of simplicity). Removing the 'false positives' leaves us with

**785,207,156**downloads.Next, we assume that everyone who has downloaded a package would have developed the package's functionality on his/her own if the package had not been available on CRAN. And let us further assume that this development effort would have taken one hour of work on average for each package. (You can play with the parameters of the model, but one hour seems really, really low, at least to me, but let's keep it conservative for now.)

But R users are not only extremely capable, almost ingenious programmers, they also have an incredible work ethic: Of course, everyone who works with R is an Elon Musk-style worker, that means he or she „puts in 80 to 90 hour work weeks, every week“ (Musk in his own words). So, let’s be conservative and assume an agreeable 80-hour work week (there should be at least some work-life balance, after all; I mean, some people even have a family!).

Calculating our model with these parameters leads to the almost incredible amount of

**188,235**work*years*saved (if you assume a year of 365/7 = 52.14 weeks; of course, our hard-working R user does not have any time for vacation or any other time off).If you assume a working life is between the age of 18 and 70 this means an amount of time has been saved by sharing packages on CRAN that is is equivalent to

**3,620**working*lives*. A truly incredible number.For all those who want to do the math themselves, here is the R code I used:

library(packagefinder)

# Attention: The next statement takes quite some time to run

# (> 1 hour on my machine)

buildIndex("searchindex.rdata", TRUE)

load("searchindex.rdata")

package.count <- sum(searchindex$index$DOWNL_TOTAL)

package.count.corrected <- package.count*0.7

package.hours <- package.count.corrected*1

package.workweeks <- package.hours/80

package.workyears <- package.workweeks/(365/7)

package.worklifes <- package.workyears/(70-18)

I think maybe you need to account for automated downloads, which would be done on continuous integration servers building containers that have the packages as dependencies? For example, jsonlite alone with just a Github search yields ~30K repositories, https://github.com/search?l=R&q=jsonlite&type=Code and just one of those repos with a CI setup could be downloading (on what appear to be unique servers given the CI service) thousands of times. There is likely a yuuuuge margin of error here, and some confidence would be needed to associate a download with an actual, single user.

ReplyDeleteThe amount of CRAN downloads do indeed incorporate automatic downloads (updates of servers), which can be a substantial error. Nevertheless is it interesting to see that by the power of the community we can really make a big step forward.

ReplyDeleteDefinitely agree! I'm an open source developer, and nothing makes me happier than to see the power of the community. I do, however, think it's important to not make misleading inferences based on assumptions or naive spirit. For example, there is a container on the Singularity container registry (singularity-images) that is used (pulled or downloaded) for testing. It also is in a few scattered demos, so one might conclude the number includes usage to some extent. How many downloads does it have? Almost 40k! See here --> https://ibb.co/iW7yDd Amazing? Possibly, but actually not, because it's pulled for testing here --> https://github.com/singularityware/singularity/blob/f2d7795d04bad5af371c3a4a0a6f372ba07ff810/libexec/python/tests/test_shub_api.py#L48 which has been done almost 2k times, each time across a grid of 6 or 7 different configurations. This is only one repo that uses the image, and it would be even more challenging to find the other places that pull it for testing or build of other software. So it's good to minimally acknowledge this huge source of missing information, because not doing so is misleading for the reader that doesn't know about the possibility.

ReplyDelete