Visualize your Github stats (forks and watchers) in a browser with R!

Author  Scott Chamberlain

So OpenCPU is pretty awesome. You can run R in a browser using URL calls with an alphanumeric code (e.g., x3e50ee0780) defining a stored function, and any arguments you pass to it.

Go here to store a function. And you can output lots of different types of things: png, pdf, json, etc - see here.

Here's a function I created:

It makes a [ggplot2][] graphic of your watchers and forks on each repo (up to 100 repos), sorted by descending number of forks/watchers.

Here's an example from the function. Paste the following in to your browser and you should get the below figure.

[http://beta.opencpu.org/R/call/opencpu.demo/gitstats/png][]

had

And you can specify user or organization name using arguments in the URL

[http://beta.opencpu.org/R/call/opencpu.demo/gitstats/png?type='org'&id='ropensci'][]

ropensci

Sweet. Have fun.

Posted in  datavisualization ggplot2 opencpu.org github

Author  Scott Chamberlain

mvabund - new R pkg for multivariate abundance data

Author  Scott Chamberlain

There is a new R package in town, mvabund, which does, as they say "statistical methods for analysing multivariate abundance data". The authors introduced the paper in an online early paper in Methods in Ecology and Evolution here, R package here.

The package is meant to visualize data, fit predictive models, check model assumptions, and test hypotheses about community-environment associations.

Here is a quick example.

mvabund1

mvabund2

Posted in  datavisualization abundance R

Author  Scott Chamberlain

Journal Articles Need Interactive Graphics

Author  Pascal Mickelson

I should have thought of it earlier: In a day and age when we are increasingly reading scientific literature on computer screens, why is it that we limit our peer-reviewed data representation to static, unchanging graphs and plots? Why do we not try to create dynamic visualizations of our rich and varied data sets? Would we not derive benefits in the quality and clarity of scientific discourse from publishing these visualizations?

An article in the very good (and under-appreciated, in my opinion) American Scientist magazine written by Brian Hayes started me thinking about these questions. "Pixels or Perish" begins by recapping the evolution of graphics in scientific publications and notes that before people were good at making plots digitally, they were good at making figures from using photographic techniques; and before that, from elaborate engravings. Clearly, the state-of-the-art in scientific publishing is a moving target.

Hayes points out that one of the primary advantages of static images is that everyone knows how to use them and that almost no one lacks the tools to view them. That is, printed images in a magazine or static digital images in the portable document format (pdf) are easily viewed on paper or on a screen and can be readily interpreted by a wide audience. While I agree that this feature is very important, why have we not, as scientists, moved to the next level? We do not lack the ability to interpret data--it is our job to do so--not to mention that we are some of the heaviest generators of data in the first place.

The obstacles to progress towards interactive data are two-fold. First, generating dynamic data visualizations is not as easy as generating static plots. The data visualization tools simply are not as well developed and they do not show up as frequently in the programming environments in which scientists work. One example Hayes cites is that the ideas from programs such as D3 have not yet made an appearance in software, like R and Matlab, that more scientists use. This is one reason why I am so excited by the work that our very own Scott has been doing with this Recology blog, in trying to promote awareness of tools in R.

The second is that neither of our currently dominant publishing formats (physical paper and digital pdf files) support dynamic graphics. Hayes says it better than I could: "…the Web is not where scientists publish…[publications are]…available through the Web, not on the Web." So, not many current publications really take advantage of the new capabilities that the Web has offered us to showcase dynamic data sets. In fact, while Science and Nature--just to name two prominent examples of scientific journals--make available HTML versions of their articles, it seems like most of the interactivity is limited to looking at larger versions of figures in the articles*. I myself usually just download the pdf version of articles rather than viewing the HTML version. This obstacle, however, is not a fundamental one; it is only the current situation.

The more serious obstacle that Hayes foresees in transitioning to dynamic graphics is one of archiving. Figures in journal articles printed in 1900 are still readable today, but there is no guarantee that a particular file format will survive in usable form to 2100, or even 2020. I do not know the answer to this conundrum. A balance might need to be struck between generating static and dynamic data. At least in the medium term, papers should probably also contain static versions of figures representing dynamic data sets. It is inelegant, but it could avoid the situation where we lose access to information that was once there.

That said, if the New York Times can do it, so can we. We should not wait to make our data presentation more dynamic and interactive. At first, it will be difficult to incorporate these kinds of figures into the articles themselves, and they will likely be relegated to the "supplemental material" dead zone that is infrequently viewed. But the more dynamic material that journals receive from authors, the more incentive they will have to expand upon their current offerings. Ultimately, doing so will greatly improve the quality of scientific discourse.

* Whether the lack of dynamic data visualization on these journals' websites is due to the authors not submitting such material or due to restrictions from the journals themselves, I do not know. I suspect the burden falls more on the authors' shoulders at this point than the journals'.

Posted in  datavisualization publishing interactivegraphics

Author  Pascal Mickelson

Take the INNGE survey on math and ecology

Author  Scott Chamberlain

Many ecologists are R users, but we vary in our understanding of the math and statistical theory behind models we use. There is no clear consensus on what should be the basic mathematical training of ecologists.

To learn what the community thinks, we invite you to fill out a short and anonymous questionnaire on this topic here.

The questionnaire was designed by Frédéric Barraquand, a graduate student at Université Pierre et Marie Curie, in collaboration with the International Network of Next-Generation Ecologists (INNGE).

Posted in  R math statistics ecology

Author  Scott Chamberlain

Scraping Flora of North America

Author  Scott Chamberlain

So Flora of North America is an awesome collection of taxonomic information for plants across the continent. However, the information within is not easily machine readable.

So, a little web scraping is called for.

rfna is an R package to collect information from the Flora of North America.

So far, you can: 1. Get taxonomic names from web pages that index the names. 2. Then get daughter URLs for those taxa, which then have their own 2nd order daughter URLs you can scrape, or scrape the 1st order daughter page. 3. Query Asteraceae taxa for whether they have paleate or epaleate receptacles. This function is something I needed, but more functions will be made like this to get specific traits.

Further functions will do search, etc.

You can install by:

install.packages("devtools")
require(devtools)
install_github("rfna", "rOpenSci")
require(rfna)

Here is an example where a set of URLs is acquired using function getdaughterURLs, then the function receptacle is used to ask whether of each the taxa at those URLs have paleate or epaleate receptacles.

Posted in  R scraping

Author  Scott Chamberlain

Fork me on GitHub