Project Euler 5 – Smallest multiple

What is the smallest, positive, number that can be divided by all numbers from 1 to 20 without any remainder?

We are given that 2520 is the smallest that can be divided by all numbers from 1:10.

One number that can definitely be divided by all numbers from 1:20 is:

factorial(20)
## [1] 2.432902e+18

But given that

factorial(10)
## [1] 3628800

is rather larger than 2520, it is definitely not the answer.

The answer must be a multiple of all the primes smaller than 20. A number that is divisible by 15, will be divisible by
3 and 5.

The library “numbers” have a lot of useful functions. Primes(20) returns all primes smaller than 20, and prod() returns the product of all those primes

library(numbers)
prod(Primes(20))
## [1] 9699690

Could that be the answer?

What we are looking at is the modulo-operator. 9699690 modulo 2 – what is the remainder? We know that all the remainders, dividing by 1 to 20 must be 0.

prod(Primes(20)) %% 2
## [1] 0

And our large product is divisible by 2 without a remainder.

Thankfully the operator is vectorized, so we can do all the divisions in one go:

9699690 %% 1:20
##  [1]  0  0  0  2  0  0  0  2  3  0  0  6  0  0  0 10  0 12  0 10

Nope.

9699690 %% 4
## [1] 2

Leaves a remainder.

(2*9699690) %% 4
## [1] 0

Now I just need to find the number to multiply 9699690 with, in order for all the divisions to have a remainder of 0.
That is, change i in this code until the answer is true.

i <- 2
all((i*9699690) %% 1:20 == 0)
## [1] FALSE

Starting with 1*9699690, I test if all the remainders of the divisions by all numbers from 1 to 20 is zero.
As long as they are not, I increase i by 1, save i*9699690 as the answer, and test again.
If the test is TRUE, that is all the remainders are 0, the while-loop quits, and I have the answer.

i <- 1
while(!all((i*9699690) %% 1:20 == 0)){
 i <- i + 1
 answer <- i*9699690
}

Mann-Whitney-wilcoxon test

The Mann-Whitney U test,

also known as the Mann-Whitney-Wilcoxon test, is a non-parametric test of a null hypothesis, that it is equally likely, that a randomly selected value from one sample, will be less than, or greater, than a randomly selected value from a second sample.

Intuitively I think I might be able to see how this translates to “the two samples are from populations that are actually different”. I have a harder time to describe that intuition.

The test is nearly as efficient as a t-test. But the advantadge is, that it does not require that we can assume that the two samples are normal distributed.

It can also be used to test if two independent samples, were selected from populations that have the same distribution.

The idea is, we have to samples. If we pick a random value from sample A, and a random value from sample B, there is a likelihood, a chance, that the value from sample A is larger than the value from sample B. Is that likelihood the same as the likelihood that the value from sample A is smaller than the value from sample B.

How to do that in R?

Let us look at some data. mtcars is a dataset with some facts about various 1974 US cars.
One of the numbers we have i mpg, miles pr gallon, or fuel-efficiency. Another is about the transmission. Is there a manual or an automatic gearbox. In the column am a “1” indicates an automatic gearbox. A “0” a manual gearbox.

We can now extract two sets of data on the fuel efficiency, based on the type of transmission:

aut <- mtcars[which(mtcars$am==1),]$mpg
man <- mtcars[which(mtcars$am==0),]$mpg

In “aut” we have the fuel efficiency of cars with an automatic gearbox, and in “man” the same, but for the, in Europe, more familiar, manual gearbox.

How to do the Wilcoxon test?

wilcox.test(aut,man)
## Warning in wilcox.test.default(aut, man): cannot compute exact p-value with
## ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  aut and man
## W = 205, p-value = 0.001871
## alternative hypothesis: true location shift is not equal to 0

The p-value is 0.001871. If we take a 0.05 significance level, we can conclude that the two sets of values are from non-identical populations. Had it been larger (quite a bit larger), we would not have been able to reject the null-hypothesis, that the two samples comes from identical populations.

There is a different way to do the test, where we dont split the dataset in two:

wilcox.test(mpg ~ am, data=mtcars)
## Warning in wilcox.test.default(x = c(21.4, 18.7, 18.1, 14.3, 24.4, 22.8, :
## cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  mpg by am
## W = 42, p-value = 0.001871
## alternative hypothesis: true location shift is not equal to 0

We tell the function that we want to test mpg (fuel efficiency), as a function of am (the type of transmission), and that the dataset we are working on is mtcars. The function then splits the dataset in two based on the value in am. Had there been more than two, things would probably stop working.

The result is the same though.

The warning

We get a warning. That is because the algorithm has a problem when there are identical values in the two datasets. There is a way around. We can add a “jitter”, to the two datasets. jitter() adds small random values to the values in the dataset. Often that solves the problem:

wilcox.test(jitter(aut),jitter(man))
## 
##  Wilcoxon rank sum test
## 
## data:  jitter(aut) and jitter(man)
## W = 205, p-value = 0.001214
## alternative hypothesis: true location shift is not equal to 0

On the other hand, we are no longer comparing the values in the dataset. We can see the difference here:

aut - jitter(aut)
##  [1] -0.051431873 -0.070007257 -0.036621231 -0.047522075 -0.039480170
##  [6]  0.056659217  0.057724536 -0.003120035 -0.033923965 -0.019053550
## [11]  0.028977073  0.001790751  0.045607303

It would probably be a bad idea to add random noise to the data. On the other hand, it is not very likely that two cars have exactly the same fuel efficiency. We have probably “binned” the values, and the addition of random noise would probably not change the values we are working on too much. But you should always consider why the warning arises, if it is actually a problem and not just a warning, and if it is appropriate to solve it in this way.

Posted in R

Project Euler 187

Project Euler 187. Semiprimes

I had to look it up. Semiprimes are numbers that are the product of two prime numbers. And only two, although they may be equal.

There are ten of them below 30: 4, 6, 9, 10, 14, 15, 21, 22, 25 and 26.

16 is not. The only primefactor is 2, but it occurs four times.

How many of these semiprimes are there below 108?

That should be pretty straightforward: Generate all primes below 108, make all the multiplications, and count how many uniqe numbers there are below n, where n=108.

One problem:

n <- 10**8
numbers <- primes(n)
length(numbers)
## [1] 5761455

That is a lot of numbers to multiply.

A trick: 2 times all the primes below n/2 will give all the semiprimes that have 2 as one of the primefactors (smaller than n).

3 times all the primes below n/3 will in the same way give all the semiprimes, that have 3 as one of the primefactors.

If I can figure out how many primes there are below n/2, I get the number of semiprimes that has 2 as one of the two primefactors. The same for the number of primes below n/3. If continue that to \(\sqrt(n)\), and add them all together, I should get the total number of semiprimes below n.

One issue though. The first prime below n/2 that I multiply by 2, is 3. And the first prime below n/3 that I multiply by 3 is 2. Both giving 6. I need to figure out how to only count 6 one time.

I just generated all the primes below n. The number of primes below n/2 is:

length(numbers[numbers<n/2])
## [1] 3001134

And the number of primes below n/3 is:

length(numbers[numbers<n/3])
## [1] 2050943

I do want to multiply 3 by 3. But I need to exclude 2.

length(numbers[numbers<n/3 & numbers>=3])
## [1] 2050942

Qap’la! One less prime.

I just need to do it for all of them.

n_semi_primes <- function(x){
  counter <- 0
  for(i in numbers[numbers<=sqrt(x)]){
    counter <- counter + length(numbers[numbers<x/i & numbers>=i])
  }
  return(counter)
}

I’m writing this as a function, taking a limit x. A counter is set to 0. And for all primes i less than \(\sqrt(n)\), I add the number of primes between i and < x/i.

I can test it on the example given:

n_semi_primes(30)
## [1] 10

That was the number of semiprimes below 30. And then it is just a question of running it on 108:

answer <- n_semi_primes(10**8)

Project Euler problem 62

Euler problem 62

The cube, 41063625 (3453), can be permuted to produce two other cubes: 56623104 (3843) and 66430125 (4053). In fact, 41063625 is the smallest cube which has exactly three permutations of its digits which are also cube.

Find the smallest cube for which exactly five permutations of its digits are cube.

Alright. I need to find five cubes, that are permutations of the same digits.

How to check if two numbers are permutations of each other?

We can generate the largest permutation of a given number. If the largest permutation of two numbers are identical, the two numbers are permutations of each other.

So I need a function, that returns the largest permutation of a number. It would be nice, if that function was vectorized.

max_perm <- function(t){
  require(magrittr)
  options(scipen=5)
  t %>% 
    as.character() %>% 
    str_split("") %>% 
    lapply(sort, decreasing=TRUE) %>% 
    lapply(paste0, collapse="") %>% 
    unlist() %>% 
    as.double()
}

Convert the input to character. Split at “”. That returns a list with vectors containing the individual digits of the input. lapply sorts the individual vectors in the list in decreasing order. Then lapply pastes the elements in each vector together with paste0 and “” as the separator. Then it is unlisted, and returned as numeric.

What is worth noting is a thing I was struggling with for far too long. R likes to write numbers in scientific notation. As in “1e+06”. I have not studied the phenomenon in detail. But options(scipen=5) solves the problem. It is the “penalty” used to decide when a number should be written in scientific notation. Unless I change that (trial and error, but it should be larger than whatever is default), as.character(1000000) will return “1e+06”. And the permutations of “1” “e” “+” “0” “6” are not terribly useful in this context.

I’m hazarding a guess that I don’t need to handle cubes of values of more than four digits.

Beginning with a vector of all numbers from 1 to 9999, I convert it to a dataframe. I transmute the first column to a column with the name x.
Then I mutate a second column, cube, into existence, and calculate it as the cube of the x-value. A third column, max_cube, is mutated with the result from my max_perm function above. And tha column is immediately used to group the data, so I get date grouped by identical maximum values of the permutations. I filter on the count of those groups, and only keep the groups that contain 5 elements. Then I ungroup it, and select just the cube column.

I now have a data frame with a single column containing 10 values. They are all cubes, five of them are permutations of each other. The other five are also permutaions of each other. And now I just have to take the smallest of them.

result <- 1:9999 %>% 
  as.double() %>% 
  as.data.frame() %>% 
  transmute(., x = .) %>% 
  mutate(cube = x**3) %>% 
  mutate(max_cube = max_perm(cube)) %>% 
  group_by(max_cube) %>% 
  filter(n()==5) %>% 
  ungroup() %>% 
  select(cube) %>% 
  min()

Before I print the result, so I can enter it into Project Euler, I set options(digits=17).

Done! A good exercise. And a valuable lesson in the importance of the options in R.

Waffle charts

A rather popular chart type. Not really my favorite, but I can see how it makes things easier to understand for people who are not used to read and understand charts. The reason for my less than favourable view on waffle charts are probably linked to its overuse in meaningless infographics.

A waffle chart is a grid with squares/cells/icons/whatever, where each cell represents a number of something.

Lets make an example:

library(ggplot2)
library(waffle)
vec <- c(`Category 1 (10)`= 10 , `Category 2 (20)`= 20,
              `Category 3 (25)`= 24, `Category 4 (16)` = 16)

waffle(vec/2, rows=3, size=0.1, 
       colors=c("#c7d4b6", "#a3aabd", "#a0d0de", "#97b5cf"), 
       title="Four different categories of something", 
       xlab="1 square = 2 somethings")

plot of chunk unnamed-chunk-11
One annoyance: waffle wants you to spell colours wrong.

waffle takes a named vector of values, rows sets the number of rows of blocks. Default is 10.

One standard way, is to show a 10×10 grid, where each cell represents 1% of the total:

waffle(vec/sum(vec)*100)

plot of chunk unnamed-chunk-12

Bloody annoying – waffle rounds the values of the vector, leading to only 98 squares. So you have to manipulate your vector to get to 100. Well, actually it is probably a minor annoyance.

What if you want something else than coloured squares?

The arguments “use_glyph” and “glyph_size” makes that possible.
First, we’ll need the library extrafont

library(extrafont)

We’ll also need to have the “awesomefonts” installed. It can be downloaded from:

http://maxcdn.bootstrapcdn.com/font-awesome/4.3.0/fonts/fontawesome-webfont.ttf

This should be easier if you are on a desktop machine. As I’m running this through my own installation of RStudio on a remote server, it was a bit more difficult.

I needed to place the “fontawesome.ttf” file in the “/usr/share/fonts/truetype/fontawesome” directory.

Then, running R as superuser on the commandline, I imported the extrafont library, and then ran “font_import()”.

But then it worked!

There is now a long list of 593 different icons, that can be used. If you want a list, just run fa_list().

And now, we can make a waffle chart with the glyph of our choice.

waffle(vec/2, rows=4, use_glyph = "wifi")
## Warning: Removed 1 rows containing missing values (geom_text).

plot of chunk unnamed-chunk-14

We can change the colours:

library(RColorBrewer)
waffle(vec/2, rows=4, use_glyph = "wifi", colors=brewer.pal(6,"Set1"))
## Warning: Removed 1 rows containing missing values (geom_text).

plot of chunk unnamed-chunk-15

Adjust the size

waffle(vec/2, rows=4, use_glyph = "wifi", colors=brewer.pal(6,"Set1"), glyph_size=5)
## Font Awesome by Dave Gandy - http://fontawesome.io
## Warning: Removed 1 rows containing missing values (geom_text).

plot of chunk unnamed-chunk-16
But that does not look very good.

waffle is based on ggplot, so we have access to the full range of functionality. But not all of them are going to look good in this context:

waffle(vec/2, rows=4, use_glyph = "wifi", colors=brewer.pal(4,"Set1")) +
  geom_label(label="42", size = 3)
## Warning: Removed 1 rows containing missing values (geom_text).

plot of chunk unnamed-chunk-17

If we install a font that supports it, we even get access to the large number of UTF-8 glyphs. Here is a favorite of mine:

waffle(vec/2, rows=4, colors=brewer.pal(4,"Set1")) +
  geom_label(label=sprintf("\U1F427"), size = 8)

plot of chunk unnamed-chunk-18

Which of course requires you to have a font on your computer that supports penguins.

Here is one:
http://users.teilar.gr/~g1951d/Symbola.zip

Posted in R

Crimes against graphs

Crime is a bad thing. No doubt about it. And one of the main topics in todays debate climate is – “those ‘orrible immigrants are very criminal. Look at these numbers, they prove it!”. Usually written with caps-lock engaged.

Well. Maybe they are, and maybe they do. But if you want to use statistics to prove it – pretty please, do not obfuscate the numbers.

This is an example. A blog post from one of the more notable danish newspapers. In the US it would be regarded as communist, in the rest of the world we would think of it as relatively conservative.

https://kulturkamp.blogs.berlingske.dk/2018/08/17/anmeldte-voldtaegter-og-voldsforbrydelser-er-paa-det-hoejeste-nogensinde/

The claim is, that the number of reported rapes and other violent crimes in Denmark, are the highest ever. That is because of the increasing numbers of immigrants in Denmark, especially muslims. Use Google translate if you want the details.

Again, that claim might be true. But the graphs in the post, that supposedly documents the claim, are misleading. To say the least.

First – the numbers come from the Danish Statistical Bureau. They have a disclaimer, telling us that changes to the danish penal code, means that a number of sexual offenses have been reclassified as violent crimes since 2013. If the number of violent crimes suddenly includes crimes that did not use to be classified as violent crimes, that number will increase. Not much of a surprise. Yes, the post asks why the numbers are still increasing after that reclassification. One should expect them to level off. And again the post may have a valid point. I don’t know. But what I do know, is that the graphs are misleading.

Heres why. The y-axis has been cut of. Lets recreate the graphs, and take a look.

There are two graphs. The first shows the number of reported cases of rape from 1995 until today.

The second shows the total number of reported cases of violent crimes in the same period. Both sets of data comes from http://www.statistikbanken.dk/.

We’re going to need some libraries:

library(ggplot2)
library(gridExtra)

Lets begin by pulling the data.

There might be better ways, but I’ve simply downloaded the data. Two files:

violence <- read.csv("tab1.csv", sep=";", skip=3, header=F)
rape <- read.csv("tab2.csv", sep=";", skip=3, header=F)
violence <- violence[1:(nrow(violence)-7),]
rape <- rape[1:(nrow(rape)-7),]

The last seven lines are the notes about changes in which cases are counted in this statistics. I think that is a pretty important point, but they are difficult to plot.

The graph for rape, as presented in the post, and with a more sensible y-axis:

post <- ggplot(rape, aes(x=V1, y=V2)) +
  geom_line(group=1) +
  scale_x_discrete(breaks = rape$V1[seq(1, length(rape$V1), by = 20)]) +
  theme_classic()

nice <- post + ylim(0,max(rape$V2))
grid.arrange(post, nice, ncol=2)

plot of chunk unnamed-chunk-4

And the one for violent crimes in general, again with the original on the left, and the better on the right:

post <- ggplot(violence, aes(x=V1, y=V2)) +
  geom_line(group=1) +
  scale_x_discrete(breaks = violence$V1[seq(1, length(violence$V1), by = 20)]) +
  theme_classic()

nice <- post + ylim(0,max(violence$V2))
grid.arrange(post, nice, ncol=2)

plot of chunk unnamed-chunk-5

So, still, some pretty scary increases. And the change in what is counted should give an increase. But that increase should level off, which it does not. Clearly something is not as it should be. But lets be honest, the graphs on the right are not quite as scary as the ones on the left.

Also – that change in what is counted as sexual assaults – it can explain the initial increase, but then it should level off. That is a fair point. However, there were other things that changed in the period. #metoo for example. I think it would be reasonable to expect that a lot of cases that used to be brushed of as not very important, are now finally being reported. The numbers might actually have leveled off without #metoo.

Anyway, my point is, that if you want to use graphs to support your claims, do NOT cut off the y-axis to make them look more convincing.

Posted in R

Migrants visualized

First of all, this is in no way a statement on the immigration crisis in Europe. I do have opinions. But it is more a reaction or reflection on three maps I saw on this page.

Danish televison channel TV2 is illustrating the number of refugees or perhaps rather immigrants received in EU-memberstates in the period 2015 to 2017. This is the map showing the number of immigrants to EU in 2015

Note Germany. Germany welcomed the absolutely highest number of immigrants. What piqued my interest though, is that this might be a good illustration of the numbers, it is not really the relevant comparisons. Yes, Germany welcomed more refugees than Denmark did. But Germany is a rather larger country than Denmark. For a given value of “fair”, it is only fair that Germany takes more refugees than smaller countries.

A more relevant comparison might be the number of refugees compared to population. Or area. Sweden saw (at that time) no problems with welcoming a huge number of migrants, because, as they said, there are a lot of un-populated space in Sweden, plenty of room for everyone! Or perhaps GDP is a better way. Richer countries should shoulder a larger part of the challenge than poorer countries.

I’m not concerned here with what is fair. What concerns me is that the graphic is misleading. Lets make an attempt at fixing that. Or at least present a slightly different perspective on the data.

I’ll try to illustrate the number of migrants as a proportion of population in the different countries. The data is “stolen” directly from the news-channel. They have it from UNHCR, Eurostat and the European Parlament.

The first step will be to get the data.

url <- "http://nyheder.tv2.dk/udland/2018-06-28-se-kortet-saa-mange-asylansoegere-har-de-forskellige-eu-lande-taget"
data <- readLines(url)

By inspection, I can see that the relevant data is in these three lines:

dat.2015 <- data[460]
dat.2016 <- data[409]
dat.2017 <- data[358]

There is a small problem. Strange danish characters are encoded. Lets fix that:

library(stringr)
data <- str_replace_all(data,"\\\\u00d8", "Ø")
data <- str_replace_all(data,"\\\\u00e6", "æ")
data <- str_replace_all(data,"\\\\u00f8", "ø")

And load it into the separate variables again:

dat.2015 <- data[460]
dat.2016 <- data[409]
dat.2017 <- data[358]

Next, I need to get the names of the countries, and the number of migrants received in each country.

The data I’m after looks like this:

\“Bulgarien\”:{\“valueheat\”:20365,\“valuecolored\”:\“none\”,\“description\”:\“\”},

And the regular expression picking that out of the data ought to be:

‘\“(\p{L}+)\”:{\“valueheat\”:(\d+|\“\”),’

For some reason that is not working. I probably should try to figure that out, but I’m on vacation, and would rather drink cold white wine that dig too deep into the weirdness that is regular expressions in R.

Instead I’m going to use this simpler pattern:

pattern <- '(\")(\\w+)(.*?)(\\d+|\\\"\\\")'

And then fix problems later.

Extracting the data:

dat.2015 <- unlist(str_extract_all(dat.2015, pattern))
dat.2016 <- unlist(str_extract_all(dat.2016, pattern))
dat.2017 <- unlist(str_extract_all(dat.2017, pattern))

Inspecting the data, reveals that the interesting parts are lines 23 to 111.

dat.2015 <- dat.2015[23:111]
dat.2016 <- dat.2016[23:111]
dat.2017 <- dat.2017[23:111]

An example of two lines:

dat.2015[11:12]
## [1] "\"Danmark\":{\"valueheat\":20935"              
## [2] "\"valuecolored\":\"none\",\"description\":\"\""

First, lets get rid of the second line. There are one of those for each country.

Secondly, I’ll remove the first to characters in the first line. \“ to be precise.

And thirdly, I’ll split that line on \”:{\“valueheat\”:

Using the nice pipes makes that easy:

library(purrr)
dat.2015 <- dat.2015 %>%
  discard(str_detect, "valuecolored") %>%
  substring(2) %>%
  strsplit(split="\":{\"valueheat\":",fixed=TRUE)

That should give me a list where each element is a vector with two elements:

dat.2015[[8]]
## [1] "Finland" "32345"

Neat! Lets do that with the other years:

dat.2016 <- dat.2016 %>%
  discard(str_detect, "valuecolored") %>%
  substring(2) %>%
  strsplit(split="\":{\"valueheat\":",fixed=TRUE)

dat.2017 <- dat.2017 %>%
  discard(str_detect, "valuecolored") %>%
  substring(2) %>%
  strsplit(split="\":{\"valueheat\":",fixed=TRUE)

Now, lets get these data into some dataframes. First I’m unlisting the data, then I pour it into a matrix to get the right shape. And then I’m converting the matrices to dataframes:

dat.2017 <- data.frame(matrix(unlist(dat.2017), ncol=2, byrow=T), stringsAsFactors = F)
dat.2016 <- data.frame(matrix(unlist(dat.2016), ncol=2, byrow=T), stringsAsFactors = F)
dat.2015 <- data.frame(matrix(unlist(dat.2015), ncol=2, byrow=T), stringsAsFactors = F)

There is a slight problem. Albania was the first country in the list. And the structure of the raw data was a bit different.

dat.2015[1,1]
## [1] "regions\":{\"Albanien"

Lets fix that:

dat.2015[1,1] <- "Albanien"
dat.2016[1,1] <- "Albanien"
dat.2017[1,1] <- "Albanien"

And let me just change the column names:

colnames(dat.2015) <- c("Land", "2015")
colnames(dat.2016) <- c("Land", "2016")
colnames(dat.2017) <- c("Land", "2017")

“Land” is danish for “country”.

I’m going to need just one dataframe. I get that by joining the three dataframes:

library(dplyr)
total <- left_join(dat.2015, dat.2016, by="Land")
total <- left_join(total, dat.2017, by="Land")

The numbers are saved as characters. Converting them to numeric:

total$`2015` <- as.numeric(total$`2015`)
## Warning: NAs introduced by coercion
total$`2016` <- as.numeric(total$`2016`)
## Warning: NAs introduced by coercion
total$`2017` <- as.numeric(total$`2017`)
## Warning: NAs introduced by coercion

That introduced some NAs. Countries where there are no data.

Inspecting the data, I can see that there are data for all three years for some countries. For other countries, there are no data at all. The function complete.cases() will return true for a row without NAs.

Using that to get rid of countries where we don’t have complete data:

total <- total[complete.cases(total),]

Next is getting some figures for the populations.

The relevant page on Wikipedia is:

url <- 'https://da.wikipedia.org/wiki/Verdens_landes_befolkningsst%C3%B8rrelser'

Getting that:

library(XML)
library(httr)
r <- GET(url)
doc <- readHTMLTable(
  doc=content(r, "text"))
tabellen <- doc$'NULL'

colnames(tabellen) <-  apply(tabellen[1,],2,as.character)

I’m only interested in the country name, and the population:

tabellen <- tabellen %>%
  select(`Land (eller territorium)`, Population)

Renaming the colums:

colnames(tabellen) <- c("Land", "Population")
tabellen$Population <- as.character(tabellen$Population)
tabellen$Population <- as.numeric(str_remove_all(tabellen$Population, fixed(".")))
## Warning: NAs introduced by coercion

And while I’m at it, the second line gets rid of the factors, and the third removes the thousand separators (“.”)

Now I can join the dataframe containing population figures, with the dataframe containing countries and number of migrants:

total <- left_join(total, tabellen, by="Land")
## Warning: Column `Land` joining character vector and factor, coercing into
## character vector

There are three smaller problems. Cyprys, France and Ireland. The problem is that the country name I get from Wikipedia contains a note. I might be able to get rid of that by code. I’m going to do it manually.

total[10,5] <- 4635400
total[7,5] <- 67286000
total[3,5] <- 847000

Now I have a nice dataframe with the name of the countries (in danish), the numbe of migrants received in 2015, 2016 and 2017, and the population in 2018.

Now it is time to look at some maps.

library(ggplot2)
library(rworldmap)
## ### Welcome to rworldmap ###
## For a short introduction type :   vignette('rworldmap')

I am going to match the countries in my dataframe, with the countries I get from the map data. That requires that I have the english names for the countries in my dataframe.

This is the translation:

enland <- c("Belgium", "Bulgaria", "Cyprus", "Denmark", "Estonia", "Finland", "France", "Greece", "Netherlands", "Ireland",
          "Italy", "Croatia", "Latvia", "Lithuania", "Luxembourg", "Malta", "Poland", "Portugal", "Romania", "Slovakia",
          "Slovenia", "Spain", "United Kingdom", "Sweden", "Czech Rep.", "Germany", "Hungary", "Austria"  )

Getting that into the data frame:

total$enland <- enland

Next, lets get the map:

worldmap <- getMap()

That retrieves data for the entire world. I’m only interested in EU:

EU <- worldmap[which(worldmap$NAME %in% enland),]
EU <- map_data(EU)
## 
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
## 
##     map

The first line extracts the part of the world map that has names in the list of countries that I have data for.
map_data() converts that into a nice structure that is suitable for entering into ggplot.

Next step is calculating the number of migrants received in each country as a proportion of that countrys population:

total <- total %>%
  mutate(`2015` = `2015`/Population*100, `2016` = `2016`/Population*100, `2017`=`2017`/Population*100)

I’m mutating the columns 2015-2017 by dividing by population. And multiplying by 100 to get percentages.

The almost final step, is to join my migrant-proportions with the map data:

total <- left_join(total,EU, by=c("enland"="region") )

The map data does not call the countries for countries. Rather their names are saved in the variable “region”.

And now the final step. I’m going to need the data on tidy form. So I’m loading tidyr.

Then I pass the data frame to select(), where I pick out the variables I need. Long(itude), lat(itude), 2015, 2016 and 2017, and the name of the country.
That is passed to gather(), where I make a new row for each year, with the proportions in the new variabels year and prop.

All that is passed to ggplot, and a layer where the polygons describing the countries are plotted. They are with a colour matching the proportions. And grouped by “group”. This is important. Grouping by country name gives weird results. I’ll get back to that. color=“white” plots the lines in the polygons in white.

Finally, I facet the data on year.

library(tidyr)
total %>%
  select(long,lat,`2015`,`2016`,`2017`, group) %>%
  gather(year, prop,  `2015`:`2017`)%>%
  ggplot() +
    geom_polygon(aes(long, lat, fill = prop, group = group), color = "white") +
    theme_void() +
    facet_wrap(~year, ncol=2)

plot of chunk unnamed-chunk-95

Thats it!
And now the picture is slightly different. What is interesting is that Germany still takes a higher proportion of the migrants than other countries. But in 2015, they didn’t. That was the year when the german chancellor Angela Merkel said the famous words “Wir schaffen das”, We’ll manage. But also the year when Hungary and sweden welcomed migrants in numbers equalling 1.79% and 1.65% of their population respectively. You can compare that with the fact that Germany the same year received migrants equalling 0.58% of their population.

A cynic might claim that it is no surprise that Sweden and Hungary closed their borders late in 2015.

Any way, that is a different subject. I just think that these three maps are slightly more informative than what TV2 provided.

Also, I promised to get back to the group thingy.

Making the same plot, but grouping on country names:

total %>%
  select(long,lat,`2015`,`2016`,`2017`, group, enland) %>%
  gather(year, prop,  `2015`:`2017`)%>%
  ggplot() +
    geom_polygon(aes(long, lat, fill = prop, group = enland), color = "white") +
    theme_void()

plot of chunk unnamed-chunk-96

What happens is that the polygons describing Italy are grouped in a way that connects the parts describing sicily to the northern part of Italy. That looks weird. The same happens with Sardinia.

Finally. I have not been very consistent in my use of words. I have used “received” and “welcomed” interchangeably. Hungary and Denmark has not been very welcoming. But we are talking about real humans here, and welcoming simply sounds nicer than received. Complicating the situation was the fact that a lot of the arrivals were not actually what we would normally call refugees. At least not refugees from war. So I have also not been consistent in the use of “migrant” vs “refugee”. That is not really my point. The point is that we should always think about how these kinds of numbers are presented.

Posted in R

Project Euler 203

Project Euler 203

Squarefree Binomial Coefficients. Building Pascals triangle to 51 rows (look it up – its a bloody nightmare to write out here): Locate all distinct numbers in that triangle, that does not divide by a square of a prime.

OK. First step is to get all the numbers in 51 rows of Pascals triangle.

R has a built in function, choose(i,j) that returns the number for row i, position j.

We can use that to iterate through all the possible positions in a 51 row triangle:

numbers <- 1
for(i in 0:50){
  for(j in 0:i){
    numbers <- c(numbers, choose(i,j))
  }
}

Next step is to make sure that we only have unique, distinct, values:

numbers <- unique(numbers)

We now need to divide each number by the square of a prime. My first instinct was to generate all primes smaller than the squareroot of the largest number in numbers.

That would be all primes lower than 11,243,247.

I would then square all those primes, and see if one of them divided neatly into the numbers in the triangel.

Thats an awfull lot of primes.

Much easier would be to note, that if the square of a prime divides neatly into a number, then the prime does as well. And is in fact a prime factor in that number.

And since we have a nice library that makes it easy to get the primefactors, thats the way to do it.

library(numbers)
library(purrr)

res <- numbers %>%
    discard(function(x) any(x%%(factors(x)**2)==0))

answer <- sum(res)+1

Passing the numbers vector to the discard function. Discard the element, if the element, modulo the primefactors in that element squared, has one (or more) results that are equal to 0.

The answer is the sum of the elements that are left. Plus 1, since 1 is discarded by the function. factors(1) – for some reason – returns 1.

Nice and simple really.

An engineers’ dream come true

Project Euler, problem 263 – An engineers’ dream come true

This is, at the time of coding, the highest numbered Project Euler problem I’ve tried to tackle. With a difficulty rating of 75% it is also the most difficult. At least on paper. But An engineers’ dream come true? How can I not, as an engineer, not try to solve it?

We har looking for an n, for which it holds that:

  • n-9 and n-3 must be consecutive primes
  • n-3 and n+3 must also be consecutive primes
  • n+3 and n+9 must also be consecutive primes.

These are primepairs that are “sexy”, that is that have differences of 6.

Also, n, n-8, n-4, n+4 and n+8 must be practical numbers, that is numbers where the numbes 1:n can be written as sums of distinct divisors of n.

So if a number n gives sexy prime pairs, and are very practical – that is what an engineer dreams of – hence the paradise.

The fastest way seems to be to generate a list of primes, sieve those out that conforms to the requirements for consecutive primes, and then test those numbers for practicality.

Lets get started!

The trusty numbers library provides the primes, up to 1000000. Then for each of those primes, return the n-value, if the differences between the sets of primes, are 6.

library(numbers)

primenumbers <- Primes(1000000)
primecandidates <- numeric()
for(i in 1:(length(primenumbers)-4)){
if((primenumbers[i+1] - primenumbers[i] == 6) & (primenumbers[i+2] - primenumbers[i+1]== 6) & (primenumbers[i+3] - primenumbers[i+2]== 6)){
  primcandidates <- c(primecandidates,(primenumbers[i]+9))
  }
}

What we get from this, is not primenumbers, but values of n, that gives the consecutive primes we need.

Now I have a list of candidate n’s based on the requirements for primes. Next I need to check for practicality.

First I tried a naive way. Generate the list of numbers that I need to sum to using distinct divisors, for a given x.
Then get the divisors of x. I dont need to check if I can sum to the numbers that are themselves divisors, so excluding them leaves me with at slightly smaller set. Then I get all the ways I can take two divisors from the set of divisors. Sum them, and exclude them from the list of numbers. I continue until I have taken all the possible combinations of 2, 3, 4 etc divisors, based on how many there are. If there are no numbers left in the vector of numbers that I need to be able to sum to, I was able to express all those numbers as sums of distinct divisors. And then x was a practical number.

practical <- function(x){
  test <- 1:x
  divs <- divisors(x)
  test <- setdiff(test,divs)
  for(i in 2:length(divs)){
    test <- setdiff(test,combn(divs,i,sum))
  }
  !as.logical(length(test))
}

Two problems. One that can be fixed. I continue to generate combinations of divisors and summing them, even if I have already found ways to sum all the numbers. The second problem is more serious. When I have to test a number with really many divisors – it takes a long time. Also, handling a vector containing all numbers 1:1000000 takes a lot of time.

I need a faster way of checking for practicality.

Wikipedia to the rescue. There exists a way of checking. I have no idea why it works. But it does.

For a number x, larger than 1, and where the first primefactor is 2. All primefactors are ordered. Taking each primefactor, that has to be smaller than or equal to the sum of the divisors of the product of all the smaller primefactors. Plus one. Oh! And that sum – if 3 is a primefactor twice, that is if 32 is a factor, I should square 3 in the product.

That sounds simple.

For a number x, get the primefactors. Use table to get the counts of the primefactors, ie that 3 is there twice. Those are the values of the result from the table function. The names of the table function are the primefactors.

For each factor from number 2 to the end of the number of factors, get the names of the primefactors from number 1 to just before that factor we are looking at (as numeric). Exponentiate with the values from the table – that is how many times a primefactor is a primefactor. Generate the product, get the divisors of that product, sum them, and add 1. If the factor we were looking at is larger that that, we have determined that x is not practical – and can return FALSE. If x gets through that process, it is practial.

I need to handle the case where there is only one primefactor – 2. Those numbers are practial, but the way I have done the check breaks when there is only one primefactor. Its simple enough, just check if there is only one distinct primefactor, and return TRUE in that case.

practical <- function(x){
  all_factors <- factors(x)
  all_factors <- table(all_factors)
  n_factors <- length(all_factors)
  res <- TRUE
    if(n_factors ==1){
    return(TRUE)
    break()
  }

  for(i in 2:n_factors){
    if((as.numeric(names(all_factors)[i]))>(sum(divisors(prod(as.numeric(names(all_factors)[1:i-1])**as.numeric(all_factors)[1:i-1])))+1)){
      return(FALSE)
      break()
      }
  }
  return(TRUE)
}

So, quite a bit faster!

Now I can take my candidate n’s based on the requirements for primepairs, and just keep the n’s that are themselves practical. And where n-8, n-4, n+4 and n+8 are also practial:

eng_numbers <- primecandidates %>%
  keep(function(x) practical(x-8)) %>%
  keep(function(x) practical(x-4)) %>%
  keep(function(x) practical(x)) %>%
  keep(function(x) practical(x+4)) %>%
  keep(function(x) practical(x+8))

eng_numbers
## numeric(0)

OK. There was no n’s in this set.

This is kinda problem. The n we are looking for are actually pretty large. I know this, because this writeup is written after I found the solution. So it is not because the code is flawed.

Nu har vi så den udfordring, at vi skal have fat i ret høje tal.

primenumbers <- primes(500000000)
primecandidates <- numeric()
for(i in 1:(length(primenumbers)-4)){
if((primenumbers[i+1] - primenumbers[i] == 6) & (primenumbers[i+2] - primenumbers[i+1]== 6) & (primenumbers[i+3] - primenumbers[i+2]== 6)){
  primecandidates <- c(primecandidates,(primenumbers[i]+9))
  }
}
eng_numbers <- primecandidates %>%
  keep(function(x) practical(x-8)) %>%
  keep(function(x) practical(x-4)) %>%
  keep(function(x) practical(x))   %>%
  keep(function(x) practical(x+4)) %>%
  keep(function(x) practical(x+8))

length(eng_numbers)
## [1] 3

That gives us three. We need four.

Now – I could just test larger and larger sets of primes. I run into memory problems when I try that. Then I could write some code, that generates primenumbers between some two numbers, and increse those numbers until I get number four.

I have not. Instead I just tried again and again, until I found number 4. We need to have an upper limit for primes that are some four times larger than the one I just used. Anyway, I found number four. And got the green tickmark.

Lesson learned

Sometimes you really need to look into the math. This would have been impossible if I had not found a quicker way to check for practicality.

Also, I should refactor this code. The check for practicality could certainly be optimised a bit.

And – I should probably check for primality, rather than generating the list of primes.

Stacked barcharts with ggplot2

Barcharts to percentages

Maybe not the most correct headline.

Anyway. Assume we have five cases, with two values for a given treatment, each labelled “AVO” and “INT”. How to get a barchart, where each cases is one bar, and the bars are coloured based on the percentages of the two values?

Lets make some sample data:

set.seed(47)
cases <- 1:5
cases <- c(cases, cases)
treat1 <- rep("AVO", 5)
treat2 <- rep("INT", 5)
treat <- c(treat1, treat2)
val <- runif(10)

df <- data.frame(cases=cases, treatment = treat, value = val, stringsAsFactors = FALSE)

How does that look?

df
##    cases treatment     value
## 1      1       AVO 0.9769620
## 2      2       AVO 0.3739160
## 3      3       AVO 0.7615020
## 4      4       AVO 0.8224916
## 5      5       AVO 0.5735444
## 6      1       INT 0.6914124
## 7      2       INT 0.3890619
## 8      3       INT 0.4689460
## 9      4       INT 0.5433097
## 10     5       INT 0.9248920

OK, nice and tidy data.

What I want is a bar for case 1, filled with percentages. One color with 0.976/(0.976+0.691) and another color for 0.691/(0.976+0.691).

library(ggplot2)
ggplot(df)+
  geom_col(aes(y=value, x=cases, fill=treatment), position = "fill")

plot of chunk unnamed-chunk-3

We’re using geom_col. That is a shorthand for geom_bar with stat=“identity”, and is the shortcut to expressing values in the barchart.

The y-value is the value from df, the x-value is the cases. And we’re colouring the bars based on the value in treatment. The position=“fill” stack the bars, and normalize the heights. If we wanted to just stack the elements, we would use position=“stack”:

ggplot(df)+
  geom_col(aes(y=value, x=cases, fill=treatment), position = "stack")

plot of chunk unnamed-chunk-4

This is a great, like really great, cheatsheet for ggplot:

https://www.rstudio.com/wp-content/uploads/2015/03/ggplot2-cheatsheet.pdf