AWS Athena dplyr unload

RBloggers|RBloggers-feedburner

Intro:

This is a quick update on the latest features for RAthena and noctua 2.4.0.

Latest features:

dbplyr:

RAthena and noctua now fully supports dbplyr backend api 2+. dplyr database generics will be deprecated in later versions of the dbplyr package development. This is to future proof RAthena and noctua, while keeping the same functionality developed for dbplyr backend api version 1.

dplyr and unload:

RAthena and noctua can now set AWS Athena Unload on a session level. This enables dplyr syntax to leverage AWS Athena unload without any extra code.

Note: Set up AWS Athena example table using AWS Athena awswrangler example:

# Python
import awswrangler as wr

import getpass
bucket = getpass.getpass()
path = f"s3://{bucket}/data/"

if "awswrangler_test" not in wr.catalog.databases().values:
    wr.catalog.create_database("awswrangler_test")

cols = ["id", "dt", "element", "value", "m_flag", "q_flag", "s_flag", "obs_time"]

df = wr.s3.read_csv(
    path="s3://noaa-ghcn-pds/csv/189",
    names=cols,
    parse_dates=["dt", "obs_time"])  # Read 10 files from the 1890 decade (~1GB)

wr.s3.to_parquet(
    df=df,
    path=path,
    dataset=True,
    mode="overwrite",
    database="awswrangler_test",
    table="noaa"
);

wr.catalog.table(database="awswrangler_test", table="noaa")

Set up connection to AWS Athena.

library(DBI)
library(RAthena) # or library(noctua)
library(dplyr, warn.conflicts = F)

con <- dbConnect(athena())

Query AWS Athena table without using AWS Athena unload method.

noaa_tbl = tbl(con, dbplyr::in_schema("awswrangler_test","noaa"))

system.time({noaa = collect(noaa_tbl)})
#> Info: (Data scanned: 80.86 MB)
#>   user  system elapsed 
#> 67.144   7.353 126.914 

dim(noaa)
#> 29554220 8

Query AWS Athena table using AWS Athena unload method.

RAthena_options(unload = TRUE) # or noctua_options(unload = TRUE) for noctua

system.time({noaa = collect(noaa_tbl)})
#> Info: (Data scanned: 80.86 MB)
#>   user  system elapsed 
#> 11.062   2.317  32.078 

dim(noaa)
#> 29554220 8

Query AWS Athena table using AWS Athena unload method, while caching.

RAthena_options(cache_size = 10, unload = TRUE) # or noctua_options(cache_size = 10, unload = TRUE) for noctua
system.time({noaa = collect(noaa_tbl)})
#> Info: (Data scanned: 80.86 MB)
#>   user  system elapsed 
#> 10.768   2.541  12.305 

dim(noaa)
#> 29554220 8

Benchmark ran on AWS Sagemaker

NOTE: Cache speeds will only benefit repeat queries!

AWS Athena Unload method doesn’t work on every sql query type. As a result, several dplyr methods won’t work when unload is set to True, for example:

noaa_tbl %>% filter(element == "TMAX") %>% compute()

#> Error: Unable to create table when `RAthena_options(unload = TRUE)`. Please run `RAthena_options(unload = FALSE)` and try again.

In general AWS Athena unload method is a nice way to speed up some queries when working with RAthena and noctua when the data returning is large.

Finally

If there is any new features or bug fixes please raise them at RAthena issues or noctua issues.