DataFrames and dplyr

dplyr

Note

This section is available as a jupyter notebook dplyr.ipynb (HTML render: dplyr.html)

from functools import partial
from rpy2.ipython import html
html.html_rdataframe=partial(html.html_rdataframe, table_class="docutils")

dplyr in Python

We need 2 things for this:

1- A data frame (using one of R’s demo datasets).

from rpy2.robjects.packages import importr, data
datasets = importr('datasets')
mtcars_env = data(datasets).fetch('mtcars')
mtcars = mtcars_env['mtcars']

In addition to that, and because this tutorial is in a notebook, we initialize HTML rendering for R objects (pretty display of R data frames).

import rpy2.ipython.html
rpy2.ipython.html.init_printing()

2- dplyr

from rpy2.robjects.lib.dplyr import DataFrame

With this we have the choice of chaining (D3-style)

dataf = (DataFrame(mtcars).
         filter('gear>3').
         mutate(powertoweight='hp*36/wt').
         group_by('gear').
         summarize(mean_ptw='mean(powertoweight)'))

dataf
DataFrame with 2 rows and 2 columns:
gear mean_ptw
0 1 4.0 1237.1266499803169
1 2 5.0 2574.0331639315027

or piping (magrittr style).

from rpy2.robjects.lib.dplyr import (filter,
                                     mutate,
                                     group_by,
                                     summarize)

dataf = (DataFrame(mtcars) >>
         filter('gear>3') >>
         mutate(powertoweight='hp*36/wt') >>
         group_by('gear') >>
         summarize(mean_ptw='mean(powertoweight)'))

dataf
DataFrame with 2 rows and 2 columns:
gear mean_ptw
0 1 4.0 1237.1266499803169
1 2 5.0 2574.0331639315027

The strings passed to the dplyr function are evaluated as expression, just like this is happening when using dplyr in R. This means that when writing mean(powertoweight) the R function mean() is used.

Using a Python function is not too difficult though. We can just call Python back from R. To achieve this we simply use the decorator rternalize.

# Define a python function, and make
# it a function R can use through `rternalize`
from rpy2.rinterface import rternalize
@rternalize
def mean_np(x):
    import numpy
    return numpy.mean(x)

# Bind that function to a symbol in R's
# global environment
from rpy2.robjects import globalenv
globalenv['mean_np'] = mean_np

# Write a dplyr chain of operations,
# using our Python function `mean_np`
dataf = (DataFrame(mtcars) >>
         filter('gear>3') >>
         mutate(powertoweight='hp*36/wt') >>
         group_by('gear') >>
         summarize(mean_ptw='mean(powertoweight)',
                   mean_np_ptw='mean_np(powertoweight)'))

dataf
DataFrame with 2 rows and 3 columns:
gear mean_np_ptw mean_ptw
0 1 4.0 1237.126649980317 1237.1266499803169
1 2 5.0 2574.0331639315023 2574.0331639315027

It is also possible to carry this out without having to place the custom function in R’s global environment.

del(globalenv['mean_np'])
from rpy2.robjects.lib.dplyr import StringInEnv
from rpy2.robjects import Environment
my_env = Environment()
my_env['mean_np'] = mean_np

dataf = (DataFrame(mtcars) >>
         filter('gear>3') >>
         mutate(powertoweight='hp*36/wt') >>
         group_by('gear') >>
         summarize(mean_ptw='mean(powertoweight)',
                   mean_np_ptw=StringInEnv('mean_np(powertoweight)',
                                           my_env)))

dataf
DataFrame with 2 rows and 3 columns:
gear mean_np_ptw mean_ptw
0 1 4.0 1237.126649980317 1237.1266499803169
1 2 5.0 2574.0331639315023 2574.0331639315027

note: rpy2’s interface to dplyr is implementing a fix to the (non-?)issue 1323 (https://github.com/hadley/dplyr/issues/1323)

The seamless translation of transformations to SQL whenever the data are in a table can be used directly. Since we are lifting the original implementation of dplyr, it just works.

from rpy2.robjects.lib.dplyr import dplyr
# in-memory SQLite database broken in dplyr's src_sqlite
# db = dplyr.src_sqlite(":memory:")
import tempfile
with tempfile.NamedTemporaryFile() as db_fh:
    db = dplyr.src_sqlite(db_fh.name)
    # copy the table to that database
    dataf_db = DataFrame(mtcars).copy_to(db, name="mtcars")
    res = (dataf_db >>
           filter('gear>3') >>
           mutate(powertoweight='hp*36/wt') >>
           group_by('gear') >>
           summarize(mean_ptw='mean(powertoweight)'))
    print(res)
#
Source: sqlite 3.8.6 [/tmp/tmp1zb7gwm3]
From: <derived table> [?? x 2]

    gear mean_ptw
   (dbl)    (dbl)
1      4 1237.127
2      5 2574.033
..   ...      ...

Since we are manipulating R objects, anything available to R is also available to us. If we want to see the SQL code generated that’s:

print(res.rx2("query")["sql"])
<SQL> SELECT "gear", "mean_ptw"
FROM (SELECT "gear", AVG("powertoweight") AS "mean_ptw"
FROM (SELECT "mpg", "cyl", "disp", "hp", "drat", "wt", "qsec", "vs", "am", "gear", "carb", "hp" * 36.0 / "wt" AS "powertoweight"
FROM "mtcars"
WHERE "gear" > 3.0) AS "zzz1"
GROUP BY "gear") AS "zzz2"

And if the starting point is a pandas data frame, do the following and start over again.

from rpy2.robjects import pandas2ri
from rpy2.robjects import default_converter
from rpy2.robjects.conversion import localconverter
with localconverter(default_converter + pandas2ri.converter) as cv:
    mtcars = mtcars_env['mtcars']
    mtcars = pandas2ri.ri2py(mtcars)
print(type(mtcars))
<class 'pandas.core.frame.DataFrame'>

Using a local converter lets us also go from the pandas data frame to our dplyr-augmented R data frame.

with localconverter(default_converter + pandas2ri.converter) as cv:
    dataf = (DataFrame(mtcars).
             filter('gear>=3').
             mutate(powertoweight='hp*36/wt').
             group_by('gear').
             summarize(mean_ptw='mean(powertoweight)'))

dataf
DataFrame with 3 rows and 2 columns:
gear mean_ptw
0 1 3.0 1633.989574118287
1 2 4.0 1237.1266499803169
2 3 5.0 2574.0331639315027

Reuse. Get things done. Don’t reimplement.

tidyr

Note

This section is available as a jupyter notebook tidyr.ipynb (HTML render: tidyr.html)

from functools import partial
from rpy2.ipython import html
html.html_rdataframe=partial(html.html_rdataframe, table_class="docutils")

tidyr in Python

from rpy2.robjects.lib.tidyr import DataFrame

(note: dplyr is implicitly used by tidyr.)

In addition to that, and because this tutorial is in a notebook, we initialize HTML rendering for R objects (pretty display of R data frames).

import rpy2.ipython.html
rpy2.ipython.html.init_printing()
from collections import OrderedDict
from rpy2.robjects.vectors import (StrVector,
                                   IntVector)
dataf = DataFrame(OrderedDict(x=StrVector(("a", "b", "b")),
                              y=IntVector((3, 4, 5)),
                          z=IntVector((6, 7, 8))))
dataf
DataFrame with 3 rows and 3 columns:
z x y
0 1 6 a 3
1 2 7 b 4
2 3 8 b 5
dataf.spread('x', 'y')
DataFrame with 3 rows and 3 columns:
z a b
0 1 6 3 NA
1 2 7 NA 4
2 3 8 NA 5

Reuse. Get things done. Don’t reimplement.