R Function Usage Frequencies

2009-12-07 3-minute read

A few months ago I decided to apply word frequency analysis ideas to R code. My idea was simple: the functions that one should invest the most effort into learning are precisely those functions that are used most frequently in real world code. In fact, this simple idea can be applied to spoken languages as valuably as to programming languages: the words that will most improve your ability to understand day-to-day speech in a foreign language are precisely the words that occur most frequently in day-to-day speech. Our language classrooms tend to screw this up by emphasizing words like “spoon” over “dude”, even though people my age will use the word “dude” many more times a day than they’ll use “spoon”. (To this day, I still mix up the words for “spoon” (“cuchara”) and “knife” (“cuchillo”) in Spanish, thought I never make any mistakes with slang words like “dude” (“tio”).) In large part, this is because the usefulness of a word tends to be confounded with its respectability when you learn a language in an academic setting. This over-emphasis on respectability is why you’re never taught curse words, even though they’re as practically useful as the majority of other words.

But I digress. Returning to R, the attached CSV data set contains the results of my word frequency analysis. To produce this data set, I spidered all of the source code for every R package on CRAN, split the resulting corpus text into lexical tokens, and counted the occurrences of each token. To make things simpler, I decided to treat every token followed by a parenthesized expression as a function, even though this gives me some syntactical units like if in my list of functions.

I think the resulting raw data set is probably as useful as any summaries I can offer of it, but I’ll offer some obvious results for those interested.

Function call frequencies in R, unsurprisingly, seem to follow Zipf’s law or some variety thereof. You can see this fairly clearly in the following logarithmic plot:

R Log Frequencies

And here are the top 25 most frequently used functions in R, including some syntactical units:

Function Name	Occurrences in Corpus
if	333421
c	143123
function	137490
length	109847
paste	62906
cat	56199
return	56001
stop	54161
is.null	53575
list	51862
log	49479
for	48950
rep	39090
names	34439
sum	31475
as.integer	29417
matrix	29344
is.na	20230
dim	20202
max	19995
nrow	19789
as.double	19705
attr	18802
t	17889
print	17461

Some fun extensions to this work would be to use this frequency data set as a standard for the abnormality of code style: you could compare an individual programmer’s code to the standard frequency data to see which functions such-and-such a programmer tends to overuse and underuse. I personally tend to underuse stop and I never use attr at all.

Another interesting project would be to use this data set as input to a text classifier that would attempt to predict the author of a piece of R code based on token frequency information, in line with Mosteller and Wallace’s famous analysis of the Federalist Papers or anti-spam programs.