Dealing with large datasets that live in a database, plotting in R can be challenging. How to do it efficiently?
Using data about world-wide chart success of songs and albums, we create an in-memory database to get going quickly without having to worry much about infrastructure. Efficiency in plotting from databases is largely determined by how much data you need to pull into R’s memory. Many plots only require aggregated data, so doing calculations inside the database pays off in terms of data transfer requirements.
We look at bar plots to compare different approaches: An inefficient way where ggplot2 does the aggregation, then manual calculation inside the database, as well as using a dedicated function from the dbplot package by Edgar Ruiz.
Further examples include a histogram, a scatterplot and a powerful alternative, a raster plot.
This is my first video that is based on a xaringan presentation, which builds on R markdown, knitr, and the remark.js javascript library. Let me know in the comments what you think of this format. I like it because all my R code stays in one place: a markdown document from which the presentation is rendered. So no copying and pasting of R code and tables and plots.
Code: https://github.com/fjodor/dataviz_ideas
File: Plotting-from-database.Rmd
More information on working with databases in R: https://db.rstudio.com/
Ideas for this video based on "Best Practices - Creating Visualizations"
—
Blog (German, translate option): https://statistik-dresden.de/statistik-blog
R Workshops: https://statistik-dresden.de/r-schulungen
Twitter: https://twitter.com/StatistikInDD
Facebook: https://www.facebook.com/statistikdresden/
Playlist: Music chart history
https://www.youtube.com/playlist?list=PL4ZUlAlk7QidRlzHEiHX09htXMAbxTpjW