Skip to content

subplot: problems with stacked histograms bins and legends when few datapoints #1456

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jumanbar opened this issue Jan 18, 2019 · 4 comments
Closed

Comments

@jumanbar
Copy link

jumanbar commented Jan 18, 2019

What I want to do

Hello! I'm currently creating a shiny app using plotly to show histograms of my data. My intention is to emulate the ggplot2 function facet_grid with stacked histograms. With enough data, the code seems to work correctly, but when the data is scarce, some strange things start to happen. Specifically, the size of the bins gets distorted and the legend has fewer categories than what it should.

Here's a quick comparison of 3 outputs:

  • Left and middle: made with ggplot and plotly respectively, using the same 50 rows of data in both cases. The latter has the aforementioned problems.
  • Right: made with plotly (using the same code as the middle figure), using 1000 rows of data. This one seems OK.

Left: ggplot2 with 50 datapoints; middle: plotly with 50 datapoints ; right: plotly with 1000 datapoints

Full-res image here.

Reprex

(not made with reprex package)

The following code sets up the R session and the data, including the amount of datapoints needed to recreate the figures (change to size = 1000 to get the one on the right):

library(ggplot2)
library(dplyr)
library(magrittr)
library(plotly)
library(scales)
library(grDevices)

set.seed(0)
di <- diamonds %>%
  sample_n(size = 50)

My goal is to reproduce plot on the left (created with the code below), w/o using the ggplotly function (mainly because of this).

p <- ggplot(di) +
  aes(price, fill = cut) +
  facet_grid(color ~ .) +
  geom_histogram() +
  scale_x_log10() +
  ggtitle("Made with ggplot2") +
  xlab("Price (US dollars)")
print(p)

ggplot figure

Manipulating the data for plot_ly

First I need to manually set the breaks and counts for the histogram bins (the type = "log" option for the xaxis has some problems). To do this I use the hist and cut functions:

# With hist I set the breaks and midpoints of said breaks:
h <- hist(di$price %>% log10, breaks = 30, plot = FALSE)

# Using cut I get a factor with all the bins:
intervals <- cut(log10(di$price), breaks = h$breaks)
midpoints <- intervals
levels(midpoints) <- h$mids

# This are the x values used in the final plots:
midpoints <- midpoints %>%
  as.character %>%
  as.numeric

Finally I calculate de counts for all the bins, using all combinations of the columns cut, color and midpoints (note that I group by intervals too, which coincide with the grouping made with midpoints; the goal is to have the intervals for hovermode info):

plot_data <- di %>%
  mutate(mids = midpoints,
         ints = intervals) %>%
  count(cut, color, mids, ints)

Setting up the plotting parameters

Here I set up the plot_ly parameters. First I set a common range for the y axis (yrange):

maxCount <- plot_data %>%
  group_by(color, mids) %>%
  summarise(N = sum(n, na.rm=TRUE)) %>%
  (function(x) max(x$N))
yrange <- c(0, maxCount) # The range for all the y axis

Then I set up both axis, which are lists. The first part involves calculating the breaks for x axis, using functions from the scales package (log10_trans and number):

plot_breaks <- log10_trans()$breaks(di$price)

xaxis <- list(
  title = "Price (US dollars)",
  tickvals = log10(plot_breaks),
  ticktext = number(plot_breaks, accuracy = min(plot_breaks)),
  range = range(log10(di$price))
  )
yaxis <- list(
  fixedrange = FALSE,
  range = yrange
  )

Creating the plot objects

In this part I split the data by the column color and then I create a list to store the plots. I also made the plotFun as a wrapper for the plotly commands for the sake of cleanlyness.

pdata_split <- split(plot_data, plot_data[["color"]])
plots <- vector("list", length(pdata_split))

plotFun <- function(i, showlegend = TRUE) {
  yaxis$title <- names(pdata_split)[i]
  pdata_split[[i]] %>%
    plot_ly(
      x = ~mids, y = ~n, color = ~cut,
      legendgroup = ~cut,
      showlegend = showlegend,
      text = ~paste0("Rango: ", ints, "\n",
                     "Conteo: ", n),
      hoverinfo = "text+name") %>%
    add_bars() %>%
    layout(
      hovermode = "compare",
      barmode = "stack",
      bargap = 0,
      xaxis = xaxis,
      yaxis = yaxis
    )
}

Finally I excecute the plot_ly commands using a loop, with an iteration for each "color" category in the dataset:

plots[[1]] <- plotFun(1, TRUE)
for (i in 2:length(pdata_split))
  plots[[i]] <- plotFun(i, FALSE)

sp <- subplot(plots, nrows = i, shareX = TRUE,
              shareY = TRUE, titleY = TRUE)

The (sub)plot:

print(sp)

Plotly made with subplot

UPDATE

Found a hacky solution to this, check my following comment.

@jumanbar
Copy link
Author

jumanbar commented Jan 18, 2019

UPDATE 2:

I found a solution, which is to add the tidyr::complete function into the creation of plot_data. There was still a problem of too much data (32340 rows, instead of 48), because complete adds all the combination of cases. It turns out I was doing it wrong, since the columns mids and ints do not combine in every posible way (mids = middle values of each bin represented in ints). So I changed my code around the part where I create plot_data, so it looks like this:

# Using cut I get a factor with all the bins:
intervals <- cut(log10(di$price), breaks = h$breaks)

# This table is used to join with the data, latter:
mid_int <- tibble(mids = h$mids, 
                  ints = levels(intervals) %>% factor)

plot_data <- di %>%
  mutate(ints = intervals) %>%
  count(cut, color, ints) %>% 
  complete(cut, color, ints, fill = list(n = 0)) %>% 
  left_join(mid_int) %>% # Join to include mids into plot_data
  select(cut, color, ints, mids, n) # Reorder columns

Now plot_data has 1155 rows, not 32340, and therefore the plot reacts faster and looks good. The problem I can imagine is that with many possible combinations it will become somewhat sluggish (for example: adding more fill categories, more breaks or adding columns to the facet plot).

Also, when using the hovermode = "compare" option, it shows data for bars that are not even there (ie: height == 0; this is also the default behaviour of the ggplot object converted with ggplotly).

Showing unnecesary data.

("Conteo" means the count for each bar.)

@cpsievert
Copy link
Collaborator

cpsievert commented Jan 21, 2019

Interesting, thanks for the thorough report!

I think this could be considered a plotly.js bug, but the problem can be fixed by explicitly setting https://plot.ly/r/reference/#bar-width to 1/2 the bin width you desire.

In other words, change

add_bars() %>%

to

add_bars(width = 0.025) %>%

and believe the problem goes away (plus, you won't need the tidyr::complete() hack), right?

@jumanbar
Copy link
Author

jumanbar commented Jan 21, 2019

Hi! I've tried your suggestion and it works for the anomalous bin width, but it messes up the legend:

result

Seems like it has to do with the data split that I made to do the multiple plots: it just uses the legend for the first plot/chunk of data, which has only a sample of the whole set of categories.

Edit: sorry I clicked the 'close issue' button by mistake :s

@jumanbar jumanbar reopened this Jan 21, 2019
@cpsievert
Copy link
Collaborator

Closing since the legend ordering is a separate issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants