Science in python: tools for faster, better biophysics

Dotson photoI came to this conference with the goal of showcasing new or underappreciated developments in software, in particular open source projects. From the conversations I’ve had with folks who use some of the tools I myself use or develop, it occurred to me that there are also many robust general-purpose software tools in existence today that are not widely used by people in our field. Somehow these tools haven’t penetrated the biophysics bubble, at least not to the degree they almost certainly could.

I’d like to introduce three tools that you may have heard of, but perhaps haven’t used or aren’t sure what role they could fill in your day-to-day research. Incidentally all of these are Python packages. Being Python, this means a few things (to quote Software Carpentry):

1. it’s free, well-documented, and runs almost everywhere;
2. it has a large (and growing) user base among scientists; and
3. experience shows that it’s easier for novices to pick up than most other languages.

If you’re new to Python, or just want to brush up, a good place to start is the Software Carpentry introductory Python lesson.

Using pandas for slicing and dicing datasets

If you’ve ever used a statistical analysis environment like R, you’ve probably used a dataframe. If not, you’ve almost certainly used a spreadsheet program, in which data is laid out (hopefully) in a well-structured way in terms of columns and rows. A dataframe is a set of columns giving observations aligned on a single index, which itself is a column of names, dates, or anything convenient to uniquely identify rows.

pandas provides a DataFrame object for use with Python, which when loaded with data from a common data format allows you to easily obtain descriptive statistics, make plots, and dive into deeper relationships within your data through groupby’s on column values.

pandas is built on top of the robust numpy library, so operations on even millions of rows are quick and fit into memory well. For datasets that are larger than your machine’s memory, there are new libraries such as dask that build on the same ideas pandas functions on.

Science jazz with the Jupyter Notebook

A pain point when writing code to analyze data is that often the context of a scientific finding is removed from the work that produced it. Plots are in a folder over here, code is in a folder over there, and meanwhile the insights into what these things mean might live in a paper notebook on the other side of the desk.

It doesn’t have to be this way. The Jupyter Project provides a notebook-style environment for writing and executing code, producing plots, and writing expressive notes about what this all means in a single place.

The Jupyter Notebook makes it easy to spend the afternoon taking a dataset and exploring it from any (or every) angle you desire, all the while keeping a record of what you’ve done and what motivated it. The notebook can also be exported to raw HTML or a PDF, and can even be used for drafting your next manuscript. And because it all runs in a browser, working on notebooks running on remote machines works just as well.

Machine learning with scikit-learn

As data becomes an embarrassment of riches and insights harder to discern, it can sometimes take a clever dimensionality reduction or two to get going. scikit-learn is a Python library that gives robust and fast implementations of many common classification, regression, clustering, and dimensionality reduction algorithms with a common interface. This last bit is important, because it makes trying out different algorithms pretty painless, and once you’ve learned the basic idea of using one algorithm you can just as easily use any other.

The documentation is also extensive, with references, and it’s written in such a way that it’s often helpful in choosing what algorithms may be most appropriate for the type of data you’ll be applying them to.

Plug in

The scientific Python ecosystem is mature and vibrant, with the scientific community standing as the beneficiary of its advances. With the explosion of “data science” as a career path in industry, the Python community has stepped up to the plate to build the tools that lead to better and faster insights. The biophysics community can gain from this too, but only if it is willing to try new tools with the potential to change the day-to-day grind of scientific research for the better.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s