Errorbars on barcharts


An errorbar is a graphical indication of the standard deviation of a measurement. Or a confidence interval.

The mean of a measurement is something. We want to illustrate how confident we are of that value.

How do we plot that?

I am going to use three libraries:


ggplot2 for plotting, dplyr to make manipulating data easier. And tibble to support a single line later.

Then, we need some data.

data <- data.frame(

I now have a dataframe with three variables, name, mean and standard deviation (sd).

How do we plot that?


data %>% 
  ggplot() +
  geom_bar( aes(x=name, y=mean), stat="identity", fill="skyblue") +
  geom_errorbar(aes(x=name, ymin=mean-sd, ymax = mean+sd), width=0.1)

plot of chunk unnamed-chunk-12

So. What on earth was that?

The pipe-operator %>% takes whatever is on the left-hand-side, and inserts it as the first variable in whatever is on the right-hand-side.

What is on the right-hand-side is the ggplot function. That would normally have a datat=something as the first variable. Here that is the data we constructed earlier.

To that initial plot, which is completely empty, we add a geom_bar. That plots the bars on the plot. It takes an x-value, name, and a y-value, mean. And we tell the function, that rather than counting the number of observations of each x-value (the default behavior of geom_bar), it should use the y-values provided. We also want a nice lightblue color for the bars.

To that bar-chart, we now add errorbars. geom_errorbar needs to know the x- and y-values of the bars, in order to place them correctly. It also needs to know where to place the upper errorbar, and the lower errorbar. And we supply the information that ymin, the lower, should be the mean value minus the standard deviation. And the upper bar, ymax, the sum of the mean and sd. Finally we need to decide how broad those lines should be. We do that by writing “width=0.1”. We do not actually have to, but the default value results in an ugly plot.

And there you go, a barchart with errorbars!

Next step

That was all very nice. However! We do not usually have a nice dataframe with means and standard deviations calculated directly. More often, we have a dataframe like this:

mtcars %>% 
  remove_rownames() %>% 
  select(cyl, disp) %>% 
##   cyl disp
## 1   6  160
## 2   6  160
## 3   4  108
## 4   6  258
## 5   8  360
## 6   6  225

I’ll get back to what the code actually means later.

Here we have 32 observations (only 6 shown above), of the variables “cyl” and “disp”. I would now like to make a barplot of the mean value of disp for each of the three different values or groups in cyl (4,6 and 8). And add the errorbars.

You could scroll through all the data, sort them by cyl, manually count the number of observations in each group, add the disp, divide etc etc etc.

But there is a simpler way:

mtcars %>% 
  remove_rownames() %>% 
  select(cyl, disp) %>% 
  group_by(cyl) %>% 
  summarise(mean=mean(disp), sd=sd(disp))
## # A tibble: 3 x 3
##     cyl  mean    sd
##   <dbl> <dbl> <dbl>
## 1     4  105.  26.9
## 2     6  183.  41.6
## 3     8  353.  67.8

mtcars is a build-in dataset (cars in the us in 1974). I send that, using the pipe-operator, to the function remove_rownames, that does exactly that. We don’t need them, and they will just confuse us. That result is then send to the function select, that selects the two columns/variables cyl and disp, and discards the rest. Next, we group the data according to the value of cyl. There are three different values, 4, 6 and 8. And then we use the summarise function, to calculate the mean and the standard deviation of disp, for each of the three groups.

Now we should be ready to plot. We just send the result above to the plot function from before:

mtcars %>% 
  remove_rownames() %>% 
  select(cyl, disp) %>% 
  group_by(cyl) %>% 
  summarise(mean=mean(disp), sd=sd(disp)) %>% 
  ggplot() +
    geom_bar( aes(x=cyl, y=mean), stat="identity", fill="skyblue") +
    geom_errorbar(aes(x=cyl, ymin=mean-sd, ymax = mean+sd), width=0.1)

plot of chunk unnamed-chunk-15
All we need to remember is to change “name” in the original to “cyl”. All done!

But wait! There is more!!

Those errorbars can be shown in more than one way.

Let us start by saving our means and sds in a dataframe:

data <- mtcars %>% 
  remove_rownames() %>% 
  select(cyl, disp) %>% 
  group_by(cyl) %>% 
  summarise(mean=mean(disp), sd=sd(disp))

geom_crossbar results in this:

data %>% 
ggplot() +
  geom_bar( aes(x=cyl, y=mean), stat="identity", fill="skyblue") +
  geom_crossbar( aes(x=cyl, y=mean, ymin=mean-sd, ymax=mean+sd))

plot of chunk unnamed-chunk-17

I think it is ugly. But whatever floats your boat.

Then there is just a vertival bar, geom_linerange. I think it makes it a bit more difficult to compare the errorbars. On the other hand, it results in a plot that is a bit more clean:

data %>% ggplot() +
  geom_bar( aes(x=cyl, y=mean), stat="identity", fill="skyblue") +
  geom_linerange( aes(x=cyl, ymin=mean-sd, ymax=mean+sd))

plot of chunk unnamed-chunk-18

And here is geom_pointrange. The mean is shown as a point. This probably works best without the bars.

data %>% ggplot() +
  geom_bar( aes(x=cyl, y=mean), stat="identity", fill="skyblue", alpha=0.5) +
  geom_pointrange( aes(x=cyl, y=mean, ymin=mean-sd, ymax=mean+sd))

plot of chunk unnamed-chunk-19

Project Euler 5 – Smallest multiple

What is the smallest, positive, number that can be divided by all numbers from 1 to 20 without any remainder?

We are given that 2520 is the smallest that can be divided by all numbers from 1:10.

One number that can definitely be divided by all numbers from 1:20 is:

## [1] 2.432902e+18

But given that

## [1] 3628800

is rather larger than 2520, it is definitely not the answer.

The answer must be a multiple of all the primes smaller than 20. A number that is divisible by 15, will be divisible by
3 and 5.

The library “numbers” have a lot of useful functions. Primes(20) returns all primes smaller than 20, and prod() returns the product of all those primes

## [1] 9699690

Could that be the answer?

What we are looking at is the modulo-operator. 9699690 modulo 2 – what is the remainder? We know that all the remainders, dividing by 1 to 20 must be 0.

prod(Primes(20)) %% 2
## [1] 0

And our large product is divisible by 2 without a remainder.

Thankfully the operator is vectorized, so we can do all the divisions in one go:

9699690 %% 1:20
##  [1]  0  0  0  2  0  0  0  2  3  0  0  6  0  0  0 10  0 12  0 10


9699690 %% 4
## [1] 2

Leaves a remainder.

(2*9699690) %% 4
## [1] 0

Now I just need to find the number to multiply 9699690 with, in order for all the divisions to have a remainder of 0.
That is, change i in this code until the answer is true.

i <- 2
all((i*9699690) %% 1:20 == 0)
## [1] FALSE

Starting with 1*9699690, I test if all the remainders of the divisions by all numbers from 1 to 20 is zero.
As long as they are not, I increase i by 1, save i*9699690 as the answer, and test again.
If the test is TRUE, that is all the remainders are 0, the while-loop quits, and I have the answer.

i <- 1
while(!all((i*9699690) %% 1:20 == 0)){
 i <- i + 1
 answer <- i*9699690

Weird behavior of is.numeric()

An interesting observation. Please note that I have absolutely no idea about why this happens. (at least not at the time of first writing this)

In our datalab, ingeniuously named “Datalab” (the one at KUB-North, because contrary to the labs at the libraries for social sciences, and for the humanities, we are not allowed to have a name), we were visited by at student.

She wanted to make a dose-response plot. Something about the concentration of something, giving some clotting of some blood. Or something…

Anyway, she wanted to do a 4-parameter logistic model. Thats nice, I had never heard about that before, but that is the adventure of running a datalab, and what makes it fun.

Of course there is a package for it, dr4pl. After an introduction to the wonderful world of dplyr, we set out to actually fit the model. This is a minimal working example of what happened:


data <- tibble(dose= 1:10, response = 2:11)

dr4pl(data, dose, response)
## Error in dr4pl.default(dose = dose, response = response, init.parm = init.parm, : Both doses and responses should be numeric.

WTF? “Both doses and responses should be numeric”?. But they are!

## [1] TRUE
## [1] TRUE

The error is thrown by these lines in the source:

if(!is.numeric(dose)||!is.numeric(response)) {
    stop("Both doses and responses should be numeric.")

Lets try:

if(!is.numeric(data$dose)||!is.numeric(data$response)) {
    stop("Both doses and responses should be numeric.")
  } else {
    print("Where did the problem go?")
## [1] "Where did the problem go?"

No idea. Did it disappear? No:

dr4pl(data, dose, response)
## Error in dr4pl.default(dose = dose, response = response, init.parm = init.parm, : Both doses and responses should be numeric.

Looking at the data might give us an idea of the source:

## Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##  $ dose    : int  1 2 3 4 5 6 7 8 9 10
##  $ response: int  2 3 4 5 6 7 8 9 10 11

Both dose and response are integers. Might that be the problem?

data <- tibble(dose= (1:10)*1.1, response = (2:11)*1.1)
dr4pl(data, dose, response)
## Error in dr4pl.default(dose = dose, response = response, init.parm = init.parm, : Both doses and responses should be numeric.

Nope. Both dose and response are now definitely numeric:

## Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##  $ dose    : num  1.1 2.2 3.3 4.4 5.5 6.6 7.7 8.8 9.9 11
##  $ response: num  2.2 3.3 4.4 5.5 6.6 7.7 8.8 9.9 11 12.1

But the problem persists.

Might this be the reason?

## # A tibble: 2 x 2
##    dose response
##   <dbl>    <dbl>
## 1   1.1      2.2
## 2   2.2      3.3

It is a tibble. And the variables are reported to be doubles.

But according to the documentation:

“numeric is identical to double (and real)”

That should not be a problem then.

However. In desperation, I tried this:

c <- data %>% %>% 
  dr4pl(dose, response)

And it works!

Why? Absolutely no idea!

Or do i?

The problem is that subsetting a tibble returns a new tibble. Well, subsetting a dataframe returns a dataframe as well?

It does. Unless you subset out a single variable or observation.

In a dataframe, df$var returns a vektor, containing the values of var in df.

If, however, df is a tibble, df$var will return a tibble, with just one variable.

Starlog – diversity

More homework for Star Trek: Inspiring Culture and Technology.

Why is it important to see yourself on television? Why is television an important subject for scholarly study and how does what we watch shape the world we live in?


Scott asks if you think we’re getting closer to realizing the Vulcan philosophy of IDIC (Infinite Diversity in Infinite Combinations) here on Earth. What would it take for that to happen? What would it look like? How might things be different?

The first question. Well, Scott answers it. It is important to see yourself on Star Trek. Because it shows a vision of a future. That has room for people like me.

I know it from myself. Seeing Jadzia Dax kiss her former wife in rejoined was important. It was not a lesbian kiss as such. And yet it was, and as a gay man, that actually meant something. Especially since we had waited for so long to actually see any representation of gay and lesbian characters in Star Trek, a thing Roddenberry had promised would be adressed in season five of TNG.

This makes a difference. We all to a certain degree view ourself through the stories we are told or shown. Small girls (and some boys) imagine themselves to be princesses when they are read fairytales. Grown men (and some women) imagine themselves to be action heroes when they watch Die Hard. And seeing yourself, or someone like you, portrayed positively makes a huge difference.

That is the real true promise of Star Trek. That the diversity we have on Earth today – not always ideal, will live on in the future, in a more positive and meaningful way, than it does for young people. No matter what their circumstances. In that words of Dan Savage, that it will get better.

That Vulcan ideal, Infinite Diversity in Infinite Combinations, is still in the future. But not as far off, as it has been. What it would take to get closer? I could come with a lot of politically correct suggestions about educating the ignorant, and fighting racism. But I think the underlying problem is scarcity, and a clash of cultures. Make sure that immigration does not mean that I have to pay more in taxes to support unemployed immigrants, and that they do not threaten my livelyhood by putting pressure on wages. And make sure that people of all cultures, accept the basic foundation of a liberal (in the original sense, not necessarily the american political sense) society, democracy, equality of the sexes, and non-discrimination, I think we will be all right. Not that we won’t still have idiots discriminating. But we should try to move to a point where idiots cannot get away with justifying their discrimination with religion and/or culture.

If we can do that, and despite bumps on the road, we are getting ever closer, we will be able to realise the alien ideal of IDIC.

Maybe that also explains why television makes sense as a subject of scholarly study. Television is one of the most important common representations of popular culture. We are placed in front of this entertainment for an inordinate amount of time every day. It affects a lot of people, in diverse ways. If that should not be an important subject of study, I don’t know what should be.

And that gives me the rank I would really like. Commander:

Where to see Great Pandas

Zoos with Great Pandas in their exhibitions, as pr. medio March ’19.

And Copenhagen Zoo, which will get their pandas in april.

Corresponding value to a max-value

One of our users need to find the max-value of a variable. He also needs to find the corresponding value in another variable.
As in – the maximum value in column A is in row 42. What is the value in column B, row 42.

And of course we need to do it for several groups.

Let us begin by making a dataset. Four groups in id,

id <- 1:3
val <- c(10,20)
kor <- c("a", "b", "c")

example <- expand.grid(id,val) %>% 
  as_tibble() %>% 
  arrange(Var1) %>% 
  cbind(kor, stringsAsFactors=F) %>% 
  rename(group=Var1, value=Var2, corr = kor)

##   group value corr
## 1     1    10    a
## 2     1    20    b
## 3     2    10    c
## 4     2    20    a
## 5     3    10    b
## 6     3    20    c

We have six observations, divided into three groups. They all have a value, and a letter in “corr” that is the corresponding value we are interested in.

So. In group 1 we should find the maximum value 20, and the corresponding value “b”.
In group 2 the max value is stil 20, but the corresponding value we are looking for is “a”.
And in group 3 the max value is yet again 20, but the corresponding value is now “c”.

How to do that?

example %>%
  group_by(group) %>% 
  mutate(max=max(value)) %>% 
  mutate(max_corr=corr[(value==max)]) %>% 
## # A tibble: 6 x 5
##   group value corr    max max_corr
##   <int> <dbl> <chr> <dbl> <chr>   
## 1     1   10. a       20. b       
## 2     1   20. b       20. b       
## 3     2   10. c       20. a       
## 4     2   20. a       20. a       
## 5     3   10. b       20. c       
## 6     3   20. c       20. c

The maximum value for all groups is 20. And the corresponding value to that in the groups is b, a and c respectively.

Isn't there an easier solution using summarise function? Probably. But our user needs to do this for a lot of variables. And their names have nothing in common.

Digital Natives

One can only hope that the concept “Digital Natives” will soon be laid to rest. Or at least all the ideas about what they can do.

A digital native is a person that grows up in the digital age, in contrast to digital immigrants, that got their familiarity with digital systems as an adult.

And there are differences. Digital natives assumes that everything is online. Stuff that is not online does not exist. Their first instinct is digital.

However, in the library world, and a lot of other places, the idea has been, that digital natives, because they have never experienced a world without computers, groks them. That they just know how to use them, and how to use them in a responsible and effective way.

That is, with a technical term, bovine feces. And for far too long, libraries (and others) have ignored the real needs, assuming that there was now suddenly no need for instruction in IT-related issues. Becase digital natives.

Being a digital native does not mean that you know how to code.

Being a digital native does not mean that you know how to google efficiently.

Being a digital native does not mean that you are magically endowed with the ability to discern fake news from facts.

I my self is a car native. I have grown up in an age where cars were ubiquitous. And I still had to take the test twice before getting my license. I was not able to drive a car safely, just because I have never known a world without cars. Why do we assume that a digital native should be able to use a computer efficiently?

The next project

For many years, from 1977 to 2006, there was a regular feature in the journal for the Danish Chemical Society. “Kemiske småforsøg”, or “Small chemical experiments”. It was edited by the founder of the Danish Society for Historical Chemistry, and contained a lot of interesting chemistry, some of it with a historical angle.

The Danish Society for Historical Chemistry is considering collecting these experiments, and publishing them. It has been done before, but more experiments were published after that.

We still don’t know if we will be allowed to do it. And it is a pretty daunting task, as there are several hundred experiments. But that is what I’m spending my free time on at the moment. If we get i published, it will be for sale at the website of the Danish Society for Historical Chemistry.

Project Euler 39

Project Euler 39

We’re looking at Pythagorean triplets, that is equations where a, b and c are integers, and:

a2 + b2 = c2

The triangle defined by a,b,c has a perimeter.

The triplet 20,48,52 fulfills the equation, 202 + 482 = 522. And the perimeter of the triangle is 20 + 48 + 52 = 120

Which perimeter p, smaller than 1000, has the most solutions?

So, we have two equations:

a2 + b2 = c2

p = a + b + c

We can write

c = p – a – b

And substitute that into the first equation:

a2 + b2 = (p – a -b)2

Expanding the paranthesis:

a2 + b2 = p2 – ap – bp – ap + a2 + ab – bp + ab + b2


0 = p2 – 2ap – 2bp + 2ab

Isolating b:

0 = p2 – 2ap – b(2p – 2a)

b(2p – 2a) = p2 – 2ap

b = (p2 – 2ap)/(2p – 2a)

So. For a given value of p, we can run through all possible values of a and get b. If b is integer, we have a solution that satisfies the constraints.

The smallest value of a we need to check is 1. But what is the largest value of a for a given value of p?

We can see from the pythagorean equation, that a =< b < c. a might be larger than b, but we can then just switch a and b. So it holds. What follows from that, is that a =< p/3.

What else? If a and b are both even, a2 and b2 are also even, then c2 is even, and then c is even, and therefore p = a + b + c is also even.

If a and b are both uneven, a2 and b2 are also uneven, and c2 is then even. c is then even. And therefore p = a + b + c must be even.

If either a or b are uneven, either a2 or b2 is uneven. Then c2 is uneven, and c is then uneven. Therefore p = a + b + c must be even.

So. I only need to check even values of p. That halves the number of values to check.

Allright, time to write some code:

current_best_number_of_solutions <- 0

for(p in seq(2,1000,by=2)){
  solutions_for_current_p <- 0
  for(a in 1:ceiling(p/3)){
      solutions_for_current_p <- solutions_for_current_p + 1
  if(solutions_for_current_p > current_best_number_of_solutions){
    current_best_p <- p
    current_best_number_of_solutions <- solutions_for_current_p

answer <- current_best_p

current_best_number_of_solutions is initialized to 0.

For every p from 2 to 1000, in steps of 2 (only checking even values of p), I set the number of solutions_for_current_p to 0.

For every value a from 1 to p/3 – rounded to to an integer: If !(p2-2*a*p)%%(2*p-2*a) is true, that is, if the remainder of (p2-2*a*p)/(2*p-2*a) is 0, I increment the solutions_for_current_p.

After running through all possible values of a for the value of p we have reached in the for-loop:

If the number of solutions for this value of p is larger, than the previous current_best_number_of_solutions, we have found a value of p that has a higher number of solutions than any previous value of p we have examined. In that case, set the current_best_p to the current value of p. And the current_best_number_of_solutions to the number of solutions we have found for the value of p.

If not, dont change anything, reset solutions_for_current_p and check a new value of p.

Project Euler 4

A palindromic number is similar to a palindrome. It is the same read both left to right, and right to left.

Project Euler tells us, that the largest palindrom made from the product of two 2-digit numbers is 9009. That number is made by multiplying 91 and 99.

I must now find the largest palindrome, made from the product of two 3-digit numbers.

What is given, is that the three digit numbers cannot end with a zero.

There are probably other restrictions as well.

I’ll need a function that tests if a given number is palindromic.

palindromic <- function(x){
  sapply(x, function(x) (str_c(rev(unlist(str_split(as.character(x),""))), collapse="")==as.character(x)))

The function part converts x to character, splits it in individual characters, unlists the result, reverses that, and concatenates it to a string. Then it is compared to the original x – converted to a character.
The sapply part kinda vectorises it. But it is still the slow part.

If I could pare the number of numbers down, that would be nice.

One way would be to compare the first and last digits in the number.

first_last <- function(x) { 
  x %/% 10^(floor(log10(x))) == x%%10

This function finds the number of digits – 1 in x. I then modulo-divide the number by 10 to the number of digits minus 1. That gives me the first digit, that I compare with the last. If the first and the last digit is the same – it returns true.

Now I am ready. Generate a vector of all three-digit numbers from 101 to 999. Expand the grid to get all combinations. Convert to a tibble,
filter out all the three-digit numbers that end with 0. Calculate a new column as the multiplication of the two numbers, filter out all the results where the first and last digit are not identical, and then filter out the results that are not palindromic. Finally, pass it to max (using %$% to access the individual variables), and get the result.


res <- 101:999 %>% 
  expand.grid(.,.) %>% 
  as_tibble() %>% 
  filter(Var1 %% 10 != 0, Var2 %% 10 != 10) %>% 
  mutate(pal = Var1 * Var2) %>% 
  filter(first_last(pal)) %>% 
  filter(palindromic(pal)) %$% 

There are probably faster ways of doing this…