ggmail + forecast = how many emails I will get tomorrow?

During the eRum 2016, Adam Zagdański gave a very good tutorial about time series modeling. Among other things I’ve learned that the forecast package (created by Rob Hyndman) got cool new plots based on the ggplot2 package.

Let’s use it to play with mailbox statistics for my gmail account!

1. Get the data

Follow this link to download the data from your gmail account as a single mbox file.
It may be large (15GB in my case), but for further steps it’s enough to keep only headers. grep + cat will do the job.

2. Read headers

The readLines() function can handle headers. Then the lubridate package is useful to extract and convert dates to the R format.

3. Basic gg-exploration

I’ve started with daily aggregates – number of emails per day.
The ts() function converts vector of aggregates to a time series object.
Then I’ve used the autoplot() function to plot the time series. Since it’s the ggplot2 plot, you can easily add a smooth trend to the plot with the geom_smooth() function.


There is some trend, but what about seasonality?
The geom_boxplot() is useful to check if there are differences among days of week or months.
It turns out that the number of emails per day is very different for week-days and weekends.
Also the August is the email-lightest month. Only, on average, 60 per day ;-)



4. Time Series

The decompose() + autoplot() functions extract trend and seasonal components from the time series. The multiplicative seasonal component is probably more appropriate here, but below the additive component is presented since it’s easier to read values on the oy axis.


A lot of models that can be fitted with the forecast package. From different choices the most scary one is for the forecast with the Holt method. Scary because of the trend.


2 myśli na temat “ggmail + forecast = how many emails I will get tomorrow?”

Dodaj komentarz

Twój adres e-mail nie zostanie opublikowany. Pola, których wypełnienie jest wymagane, są oznaczone symbolem *

Możesz użyć następujących tagów oraz atrybutów HTML-a: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code class="" title="" data-url=""> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre class="" title="" data-url=""> <span class="" title="" data-url="">