Project Euler 54

Poker hands. Project Euler number 54

Given 1000 random hands of 5 playing cards dealt to two players, how many hands does player 1 win?

We have the following notation for the cards:

2, 3, 4, 5, 6, 7, 8, 9, T, J, Q, K, A

Where T is ten, J is Jack, Q is Queen, K is King and A is Ace.

And the suits:

H, C, S, D

We have a number of different combinations:

  1. High Card: Highest value card.
  2. One Pair: Two cards of the same value.
  3. Two Pairs: Two different pairs.
  4. Three of a Kind: Three cards of the same value.
  5. Straight: All cards are consecutive values.
  6. Flush: All cards of the same suit.
  7. Full House: Three of a kind and a pair.
  8. Four of a Kind: Four cards of the same value.
  9. Straight Flush: All cards are consecutive values of same suit.
  10. Royal Flush: Ten, Jack, Queen, King, Ace, in same suit.

If a given hand is a Flush, it gets the rank of 6. Which means it beat a Straight, which has the rank of 5.

That should be simple. Write a function that returns the rank of a given hand, and determine if one hand beats another hand.

However, two hands with the rank 2, are not necessarily equal. One pair of fours is better than one pair of twos.

So the function should also return some value that can be used for discriminating between different hands with the same rank.

A hand comes on the form:

5H 5C 6S 7S KD

I think it might be useful to have a function that returns the suits and the values.

library(dplyr)
suits <- function(s){
  delres <- unlist(strsplit(s, split = " "))
  sapply(delres, function(x) substr(x,2,2), USE.NAMES = FALSE)
}

values <- function(s){
  delres <- unlist(strsplit(s, split = " "))
  sapply(delres, function(x) substr(x,1,1), USE.NAMES = FALSE)
}

Testing for a Royal Flush. All values of suits should be identical. And the values should be T, J, Q, K, A. That is simple. Check if all the values that should be in the hand, if it is a Royal Flush are there. Then check if the suit is the same for all cards in the hand. If both are true, return TRUE.

is_royal_flush <- function(testhand){
  testvalue <- c("T", "J", "Q", "K", "A")
  value_match <- all(testvalue %in% values(testhand))
  suits_match <- length(unique(suits(testhand)))==1
  return(as.logical(value_match*suits_match))
}

Straight Flush: All cards are consecutive values of same suit.

Testing for the same suit is copy-paste from above. Testing for consecutive values is a little bit more tricky. I cant just sort the values, since K is before Q in the alphabet. So I’m using dplyr::recode() to recode the T, J, Q, K, A values to numbers. The I can convert them to numeric and sort them. And I’ll just have to figure out if the values are consecutive.

I had to google a bit. But diff() does it. It calculates the difference between the elements. It is possible to define a lag, but here the default of 1 does the trick. It returns the second element in the vector, minus the first element. And the third element minus the second.

When I have five elements, and they have consecutive values, diff() will return a vector of 4 ones.

is_straight_flush <- function(testhand){
  suits_match <- length(unique(suits(testhand)))==1
  testvalues <- values(testhand)
  testvalues <- recode(testvalues, T = "10", J = "11", Q = "12", K = "13", A = "14")
  testvalues <- as.numeric(testvalues)
  testvalues <- sort(testvalues)
  value_match <- all(diff(testvalues) == c(1,1,1,1))
  return(as.logical(value_match*suits_match))
}

Four of a Kind: Four cards of the same value.

We are not interested in the suits here. First I get the values. Then I use table() to get the counts of the different values. Passing that to max() returns 4, if there are four identical values.

is_four_of_a_kind <- function(testhand){
  testvalues <- values(testhand)
  testvalues <- table(testvalues)
  return(max(testvalues)==4)
}

Full House: Three of a kind and a pair. I use the same functionality as in is_four_of_a_kind. The trick here is, that the max of the tabulation of values should be 3. But the minimum should also be 2.

is_full_house <- function(testhand){
  testvalues <- values(testhand)
  testvalues <- table(testvalues)
  (max(testvalues)==3)&(min(testvalues)==2)
}

Flush: All cards of the same suit. In is_straight_flush I test if the suits are the same. That is what I need here, I just skip the part where I test if the values are consecutive.

is_flush <- function(testhand){
  suits_match <- length(unique(suits(testhand)))==1
  return(as.logical(suits_match))
}

Straight: All cards are consecutive values. This is more or less the same function is_straight_flush(). I just skip the test for suits.

is_straight <- function(testhand){
  testvalues <- values(testhand)
  testvalues <- recode(testvalues, T = "10", J = "11", Q = "12", K = "13", A = "14")
  testvalues <- as.numeric(testvalues)
  testvalues <- sort(testvalues)
  value_match <- all(diff(testvalues) == c(1,1,1,1))
  return(as.logical(value_match))
}

Three of a Kind: Three cards of the same value. That is basically the same as the function is_four_of_a_kind() above. The maximum value from the table should just be 3 instead of 4:

is_three_of_a_kind <- function(testhand){
  testvalues <- values(testhand)
  testvalues <- table(testvalues)
  return(max(testvalues)==3)
}

Two Pairs: Two different pairs. First I use values() to get the values of the hand. I’m not interested in the suits.
Then I tabulate the testvalues. I get a count of the number each value occurs in the hand.
But I’m no interested in that number. If there are two pairs in the hand, I will get two values, that occurs two times.

So if I run table() on the result again, I get a count of how many times the count 2 occurs. If the count 2 occurs twice, there are two hands.

is_two_pairs <- function(testhand){
  testvalues <- values(testhand)
  testvalues <- table(testvalues)
  testvalues <- table(testvalues)
  2 %in% testvalues
}

One Pair: Two cards of the same value. Again, basically the same as is_three_of_a_kind(), with 2 instead of 3 in the test:

is_pair <- function(testhand){
  testvalues <- values(testhand)
  testvalues <- table(testvalues)
  return(max(testvalues)==2)
}

It will return FALSE if there are three of a kind. I’m interested in the highest rank a hand can get, so that should not be a problem.

High Card: Highest value card. I don’t need to test for this. If a hand has not gotten a higher rank, it will always have this.

Nice. I now have a set of functions that I can use to figure out what hand has the highest rank.

Lets put that together:

rank <- function(testhand){
  res <- 1 # The hand will never get a rank lower than 1.
  if(is_pair(testhand)){res <- 2}
  if(is_two_pairs(testhand)){res <- 3}
  if(is_three_of_a_kind(testhand)){res <- 4}
  if(is_straight(testhand)){res <- 5}
  if(is_flush(testhand)){res <- 6}
  if(is_full_house(testhand)){res <- 7}
  if(is_four_of_a_kind(testhand)){res <- 8}
  if(is_straight_flush(testhand)){res <- 9}
  if(is_royal_flush(testhand)){res <- 10}
  return(res)
}

That should solve a lot of the cases. However. What if the rank of two hands are the same?

If both hands have rank 9, that is if they are both straight flush, the hand with the highest card wins. There is no reason to test for the rest of the cards.

For all the others, I need functions that breaks the tie. It it should be broken.

Most of those determinations will be based on the value of the cards. So I might as well have a function that recodes ace to 14 etc.

recode_value <- function(values){
  as.numeric(recode(values, T = "10", J = "11", Q = "12", K = "13", A = "14"))
}

I’m probably going to need the reverse function:

rev_recode_value <- function(values){
  recode(as.character(values), "10" = "T", "11" = "J", "12" = "Q", "13" = "K", "14" = "A")
}

For High Card, the hand with the highest value card wins. If the highest value card in both hand 1 and hand 2 is equal, we look at the second highest card in each hand. Etc. Only if the values are identical in the two hands is there a tie.

break_high_card <- function(hand1, hand2){
  v1 <- values(hand1)
  v2 <- values(hand2)
  v1 <- sort(recode_value(v1))
  v2 <- sort(recode_value(v2))
  if((identical(v1,v2))){
    return(0)
    break()
  } else{
  while(v1[length(v1)]==v2[length(v2)]){
    v1 <- v1[-length(v1)]
    v2 <- v2[-length(v2)]
  }  
  }
  if(max(v1)>max(v2)){
    return(1)
  } else {
    return(2)
  }
}

All right. That was not easy… First I get the values for the two hands. Then I recode them to numerical values, and sort them.

If the two hands have equal values, it is a tie. Otherwise, while the last element in the first vector containing the values is equal to the last element in the second vector, I get rid of the last element in both vectors.

When that is done, the last element in the two vectors is compared. The hand with the highest value wins.
The funcion returns 0 if it is a tie, 1 if hand1 wins, and 2 if hand2 wins.

This was actually the hardest part to write.

Next. Breaking the tie between two hands each having one pair.

break_one_pair <- function(hand1, hand2){
  v1 <- values(hand1)  
  v2 <- values(hand2)
  v1 <- recode_value(v1)
  v2 <- recode_value(v2)
  v1 <- table(v1)
  v2 <- table(v2)
  v1_pair <- as.numeric(names(v1)[v1==2])
  v2_pair <- as.numeric(names(v2)[v2==2])
  if(v1_pair>v2_pair){
    return(1)
    break()
  }
  if(v2_pair>v1_pair){
    return(2)
    break()
  }
  if(v2_pair==v1_pair){
    v1 <- names(v1)[!v1==2]
    v2 <- names(v2)[!v2==2]
    v1 <- rev_recode_value(v1)
    v2 <- rev_recode_value(v2)
    return(break_high_card(v1,v2))
    break()
  }
}

First I’m picking out the values of the pairs in the two hands. If hand1 has a pair of 8’s and hand2 has a pair of 7’s, hand1 wins.

If the pairs are of the same value. Then I pick out the values of the rest of the cards. I am making a dangerous assumption here. This will only work if the two hands actually only have one pair. I hope it works, but this is one of the places where it could go wrong.

Two Pairs: Two different pairs.

The hand with the highest value pair wins. If those values are identical, the value of the second pair determines the winner. If that is also equal, the value of the fifth remaining card, determines the winner.
Only if that is also equal, is there a tie. Therefore, there can only be a tie, if all values of all cards are equal.

break_two_pairs <- function(hand1, hand2){
  v1 <- values(hand1)  
  v2 <- values(hand2)
  v1 <- recode_value(v1)
  v2 <- recode_value(v2)
  v1 <- table(v1)
  v2 <- table(v2)
  v1_pair <- sort(as.numeric(names(v1)[v1==2]))
  v2_pair <- sort(as.numeric(names(v2)[v2==2])) # now v1_pair and v2_pair contains the values of each of the pairs in each of the hands.
  if(identical(v1_pair,v2_pair)){        # If the pairs have the same value, pick out the fifth value, and compare them.
    v1 <- as.numeric(names(v1)[!v1==2])
    v2 <- as.numeric(names(v2)[!v2==2])
    if(v1>v2){
      return(1)
      break()
    }
    if(v2>v1){
      return(2)
      break()
    }
    if(v1==v2){
      return(0)
      break()
    }
  } else {
    v1 <- rev_recode_value(v1_pair)
    v2 <- rev_recode_value(v2_pair)
    return(break_high_card(v1,v2))
  }
}

Kinda the same as for one pair. v1_pair and v2_pair ends up with vectors containing the values of the two pairs in the two hands.

If these two vectors are identical, we need to look at the fifth value.

If not, I reverse_recode the values of the pairs, and use break_high_card, to determine which is larger.

Three of a Kind: Three cards of the same value.

Again I’m making the assumption that the two hands passed to this function have three of a kind. And nothing better than that.

This is a bit simpler. There is no way that both hands can have three of a kind with the same value. So I don’t need to consider the remaining cards. The assumption here is that there is only on deck of cards involved. This is another place where things can go wrong. But lets try!

break_three_of_a_kind <- function(hand1, hand2){
  v1 <- values(hand1)
  v2 <- values(hand2)
  v1 <- recode_value(v1)
  v2 <- recode_value(v2)
  v1 <- table(v1)
  v2 <- table(v2)
  v1 <- as.numeric(names(v1)[v1==3])
  v2 <- as.numeric(names(v2)[v2==3])
  if(v1>v2){
    return(1)
    break()
  } else {
    return(2)
    break()
  }
}

I convert to values, recode them, tabulates them. And pick out the values that occur three times. Then I just compare them.

Straight: All cards are consecutive values.

Once more I assume that the two hands passed to this function actually only contains a straigth. It is rather simple. As the values must be consecutive, the highest value in each hand determines the winner.

break_straigth <- function(hand1,hand2){
  v1 <- values(hand1)
  v2 <- values(hand2)
  v1 <- recode_value(v1)
  v2 <- recode_value(v2)
  v1 <- max(v1)
  v2 <- max(v2)
  if(v1>v2){
    return(1)
    break()
  } 
  if(v2>v1){
    return(2)
    break()
  }
  if(v2==v1){
    return(0)
    break()
  }
}

Flush: All cards of the same suit.

Again – I assume that flush is the highest combination in these hands. This is basically the same as high card.

break_flush <- function(hand1, hand2){
  return(break_high_card(hand1,hand2))
}

Full House: Three of a kind and a pair.

The same assumption applies. These two hands both have Full House as the highest rank.

The winner is determined by the higher value of the three of a kind.

break_full_house <- function(hand1,hand2){
  return(break_three_of_a_kind(hand1,hand2))
}

Four of a Kind: Four cards of the same value.

Same as for the full house. The winner is determined by the higher value of the four of a kind.

break_four_of_a_kind <- function(hand1,hand2){
  v1 <- values(hand1)  
  v2 <- values(hand2)
  v1 <- recode_value(v1)
  v2 <- recode_value(v2)
  v1 <- table(v1)
  v2 <- table(v2)
  v1 <- sort(as.numeric(names(v1)[v1==4]))
  v2 <- sort(as.numeric(names(v2)[v2==4]))
  if(v1>v2){
    return(1)
    break()
  } else {
    return(2)
    break()
  }
}

Straight Flush: All cards are consecutive values of same suit.

Again I’m making dangerous assumptions. But, this is functionally the same as high card.

break_straight_flush <- function(hand1,hand2){
  return(break_high_card(hand1,hand2))
}

Royal Flush: Ten, Jack, Queen, King, Ace, in same suit.

If both hand are royal flush – it will always be a tie.

break_royal_flush <- function(hand1,hand2){
  return(0)
}

Phew…

I now have a function that will return the rank of a given hand.

If I compare the rank of two hands, I will either get a clear winner, or need to run one of the break functions to determine the result.

winner <- function(hand1, hand2){
  if(rank(hand1)>rank(hand2)){
    return(1)
    break()
  }
  if(rank(hand2)>rank(hand1)){
    return(2)
    break()
  }
  if(rank(hand2)==rank(hand1)){
    rang <- rank(hand2)
    if(rang==1){
      return(break_high_card(hand1,hand2))
      break()
    }
    if(rang==2){
      return(break_one_pair(hand1,hand2))
      break()
    }
    if(rang==3){
      return(break_two_pairs(hand1,hand2))
      break()
    }
    if(rang==4){
      return(break_three_of_a_kind(hand1,hand2))
      break()
    }
    if(rang==5){
      return(break_straigth(hand1,hand2))
      break()
    }
    if(rang==6){
      return(break_flush(hand1,hand2))
      break()
    }
    if(rang==7){
      return(break_full_house(hand1,hand2))
      break()
    }
    if(rang==8){
      return(break_four_of_a_kind(hand1,hand2))
      break()
    }
    if(rang==9){
      return(break_straight_flush(hand1,hand2))
      break()
    }
    if(rang==10){
      return(break_royal_flush(hand1,hand2))
      break()
    }
  }
}

Phew… Again.

What should be left now, is to load the hands, and test them.

data <- read.csv("https://projecteuler.net/project/resources/p054_poker.txt", header=FALSE, stringsAsFactors = FALSE)
library(tidyr)
library(stringr)

answer <- data %>%
  mutate("h1" = str_sub(V1, 1,14)) %>%
  mutate("h2" = str_sub(V1, 16,30)) %>%
  rowwise() %>%
  mutate("w" = winner(h1,h2)) %>%
  filter(w==1) %>%
  nrow()

I read in the file to data, make a column h1 containing the first 14 characters of each line in the datafile, and a column h2 containing the rest of the line.

rowwise() groups the dataframe by row – so it doesn’t matter that my winner() function is not that vectorized. For each set of hands I calculate the winner, filter the result so I only have the rows where hand 1 wins. And then I get the number of rows – which are the answer.

Lessons learned

Some. rowwise() is a usefull function. I’m certain I dont understand everything about it, but it would seem to solve a fair number of the problems I’ve had using non-vectorized functions.

I should probably also learn to set aside some time for re-factoring my code before publishing it.

Project Euler 37

Project Euler 37

3797 is a prime. What is interesting about this prime, is that if we truncate it one digit at a time from left, ie:
3797, 797, 97, 7 every step is also prime. And if we do the same from the right, we see the same property:
3797, 379, 37, 3.

There are 11 primes with this property. What is the sum of them?

2, 3, 5 and 7 are not considered to be truncatable primes.

And that gives us a hint. The first digit of a truncatble prime must be either 2, 3, 5 or 7. And the last digit must be 3, 5 or 7.

Lets begin by loading the package numbers:

library(numbers)

That gives us couple of useful functions. isPrime(x) returns true if x is prime. And Primes(x,y) returns all primes between x and y.

I am going to need a couple of functions. One that repeatedly truncates a number from the left, and returns true if all the steps in the sequence are prime. And another that does the same, just from the rigth.

truncleft <- function(x){
  res <- TRUE
  while(x>9){
    x <- x %/% 10
    res <- isPrime(x)*res
  }
  return(as.logical(res))
}

truncright <- function(x){
  res <- TRUE
  while(x>9){
    x <- x - (x %/% 10**floor(log10(x)))*(10**floor(log10(x)))
    res <- isPrime(x)*res
  }
  return(as.logical(res))
}

Now I can determine if a given number is truncatable prime from both left and right. Next, finding all 11 primes with that property.

Lets take a chance, and see if not number 11 is found before we reach 1000000

library(purrr)
answer <- Primes(9,1000000) %>%
  keep(truncleft) %>%
  keep(truncright) %>%
  sum()

Yep. The difficult part is finding number 11. The first four primes with this property have two digits – you can find them manually. The next four are a bit more difficult. Number 10 is the example Project Euler provides.

Lessons learned

Well – purrr is pretty useful. But I already knew that.

Project Euler – 28

Number spiral diagonals

We can make squares of numbers, by starting with 1, and add numbers in a spiral.

It is easier to show than descripe:

square <- matrix(c(21, 22, 23, 24, 25,20,  7,  8,  9, 10,19,  6,  1,  2, 11,18,  5,  4,  3, 12,17, 16, 15, 14, 13), 5,5,byrow=T)
square
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   21   22   23   24   25
## [2,]   20    7    8    9   10
## [3,]   19    6    1    2   11
## [4,]   18    5    4    3   12
## [5,]   17   16   15   14   13

This is a 5 by 5 square. Start at 1, add the 2 to the right of that, and add numbers in a clockwise spiral.

If we look at all the numbers in the diagonals, we can see that the sum of those numbers, is 101. Ie, 21 + 7 + 1 + 3 + 13 + 25 + 9 + 5 + 17 = 101

We are now asked: What is the sum of the numbers in the diagonals in a 1001 by 1001 square?

Looking at an n by n square, where n is uneven, we can see that the north-eastern corner of the square will always be n2.

We know that the side is n. So the value of the north-western corner will always be n2 – (n-1).

Similar the south-western corner is n2 – 2(n-1). And the south-eastern is n2 – 3(n-1).

The first “square” is a special case. But I can now calculate the corners in all the squares with unequal n.

The sum of those corners is:

n2 + n2 -(n-1) + n2 – 2(n-1) + n2 – 3(n-1)

Which reduces to:

4*n2 – 6n + 6

If I calculate that for all unequal n from 3 to 1001, sum them, and add 1, I have my answer.

library(purrr)
answer <- seq(3,1001,by=2) %>%
  map(function(x) 4*x*x - 6*x + 6) %>%
  unlist() %>%
  sum() 
answer <- answer + 1

Lessons learned

Not really.

Sudokus – Project Euler 96

This problem is basically about making a sudoku solver, that can be automated. If you don’t know what a sudoku is – google it and come back.

I have solved a LOT of sudokus. Manually. There are several techniques, but some of them are not that easy to code. And of course I’ll need a method that guarantees a solution.

Fortunately a brute force solution exists:

https://en.wikipedia.org/wiki/Sudoku_solving_algorithms

It is called backtracking. The process is to look at all the empty cells.

In the first one – insert the lowest allowed value, and progress to the next empty cell. Insert the lowest allowed value in it. Continue until you reach a cell with no allowed values. When you do that, go back to the previous cell, and insert the next-lowest allowed value. And then go back (forward?) to the next cell. If the cell you went back to also have no allowed values, you go one step futher back. Continue until you reach the last cell in the puzzle.

So. We’ll need a few things.

First of all we’ll need a matrix in which to keep the sudoku.

I’m going with a matrix rather than a dataframe after reading the R Inferno (https://www.burns-stat.com/pages/Tutor/R_inferno.pdf), that suggests that there is no reason to work with a dataframe if a matrix will do.

I’ll need a couple, or three (actually more like five) functions to figure out what values are allowd for a given cell in a given sudoku.

And I’ll need a way to keep track of which cells that are “frozen”, and which cells should be filled out.

Lets begin by making a matrix to keep the puzzle in. Project Euler provides us with a puzzle and the matching solution. And a file with 50 puzzles that needs to be solved.

Lets read in that file.

problems <- read.csv("p096_sudoku.txt",header=FALSE, stringsAsFactors = FALSE)

The structure is rather simple. 500 rows. First one row with an identifier, eg “Grid 01”. And then 9 rows with just the rows of digits, 0 for the empty cells.

We’ll need to pick out first the rows 1 to 10, then rows 11 to 20 etc.

Lets make a dataframe, and read it in:

start <- seq(1,500,by=10)
end <- seq(10,500,by=10)
prob_df <- data.frame(id=character(),problem=character(), stringsAsFactors = FALSE)
for(i in 1:50){
  prob_df[i,2] <- paste(problems[start[i]:end[i],][-1], collapse="")
}
prob_df[1,2]
## [1] "003020600900305001001806400008102900700000008006708200002609500800203009005010300"

Nice and simple, 50 rows with an 81 character long string of digits. This is the first puzzle in the file. It is also the puzzle where we are provided with a solution.

We now need to convert that to a matrix.

Or – we don’t need to. There are certainly techniques that would allow me to handle it as-is. But I think it is more intuitive to get it into a matrix.

So, lets do that.

mat <- matrix(as.numeric(unlist(strsplit(prob_df[1,2],""))),9,9,byrow=TRUE)  
mat
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
##  [1,]    0    0    3    0    2    0    6    0    0
##  [2,]    9    0    0    3    0    5    0    0    1
##  [3,]    0    0    1    8    0    6    4    0    0
##  [4,]    0    0    8    1    0    2    9    0    0
##  [5,]    7    0    0    0    0    0    0    0    8
##  [6,]    0    0    6    7    0    8    2    0    0
##  [7,]    0    0    2    6    0    9    5    0    0
##  [8,]    8    0    0    2    0    3    0    0    9
##  [9,]    0    0    5    0    1    0    3    0    0

Neat. Split the string in individual characters. Unlist it, and cast it to numeric. Then pass it all to matrix(), noting that we want 9 rows and 9 columns. And that the matrix should be filled by row.

Depending on where you count from, the first empty cell is 1,2. Given the values that are already filled in, what values are allowed in that cell?

First, what values are allowed based on the row? From inspection it is obvious that only the values 1, 4, 5, 6, 7, 8 and 9 are allowed. Lets write a function for that:

rowall <- function(x,mat){
  setdiff(1:9, mat[x,])
}
rowall(1,mat)
## [1] 1 4 5 7 8 9

The function takes a row-number, in this example 1, and a matrix. mat[x,] returns the values in that row. And setdiff returns the values that are in 1:9 but not in the list of values already in the row.

More or less the exact same thing for the column:

# returnerer tilladte værdier i kolonne y i en matrix mat
colall <- function(y,mat){
  setdiff(1:9, mat[,y])
}
colall(2, mat)
## [1] 1 2 3 4 5 6 7 8 9

What about the frame, the 3×3 set of numbers?

I need a subset of the matrix. For our example, cell 1,2, I need the columns 1, 2 and 3. And the rows 1, 2 and 3. Or a vector 1:3. Thankfully there is symmetry, so column 1 gives the same vector as row 1. I need a function that returns 1:3 when I pass it 1, 2 or 3. 4:6 for the values 4, 5 or 6. and 7:9 for the values 7, 8 or 9.

My good colleague Henrik has tried to solve the same problem using Python. And he suggested that the way to do it is by integer division. In this way:

framelookup <- list(a=c(1,2,3),b=c(4,5,6),c=c(7,8,9))

getint <- function(x){
  x <- (x -1)%/%3+1
  unlist(framelookup[x])
}
  • Take the value x.
  • Substract 1 and divide the result by 3.
  • Then add 1. For x = 1,2 or 3, the result is 1, for x = 4, 5 or 6, the result is 2. And for 7 through 9, we get 3.
  • We can then use that to look up the vector we need in a list, and return the unlisted number.

At first I had done it with three if-statements. This solution is a bit more elegant, and Henrik thinks that if-statements should be outlawed. So in the interest of keeping the peace in our office, I’m going with that.

Now I can write a function that returns the values allowed in a given cell in a given matrix, given the values that are already filled out in the frame that cell is in:

fraall <- function(x,y,mat){
  res <- mat[getint(x),getint(y)]
  setdiff(1:9, res)
}

The function takes the coordinates of the cell, and a matrix. Based on that, getint() returns the vectors describing the frame, and saves that subframe in res. Then I use the setdiff() function in the same way I’ve done previously.

That gives me all the constraints. What values are allowed based on row, column and frame. The intersection between these three vectors gives me the allowed values for the cell. It is now simple to write a function that returns the values allowed for a given cell in a given matrix:

allowed <- function(x,y,mat){
  res <- intersect(rowall(x,mat), colall(y,mat))
  res <- intersect(res, fraall(x,y,mat))
  return(res)
}

Actually this is all I need for solving quite a lot of sudokus. Go through all the empty cells. Figure out what values are allowed. If only one value is allowed, plot it into the cell. Repeat until there are no more empty cells where only one value is allowed. A lot of sudokus can be solved with only that method.

The ones that cannot be solved in this way can be solved by the backtracking algorithm. But that is slow, so it might be a good idea to simplify the those puzzles by filling out what can be filled out with this method first.
So let’s write a function that does exactly that.

First I’ll need a list of the empty cells. The which() function can help me.

test <- which(mat ==0, arr.ind=T)
nrow(test)
## [1] 49

Which parts of the matrix is equal to 0? and arr.ind tells that what we want returned (when which is used on an array) is the array indeces. There are 56 0’es in the puzzle.

The matrix test has this structure:

head(test)
##      row col
## [1,]   1   1
## [2,]   3   1
## [3,]   4   1
## [4,]   6   1
## [5,]   7   1
## [6,]   9   1

The first pair of values is 2,1, the next is 4,1 etc. And if I take each one of those, looks at the allowed values for those positions in the matrix, and, if there is only one value allowed, place that value in that position, I get a step closer to the solution.

This does exactly that:

for(i in 1:nrow(test)){
  all <- allowed(test[i,1],test[i,2],mat)
  if(length(all)==1){
    mat[test[i,1],test[i,2]] <- all
  }
}
nrow(which(mat ==0, arr.ind=T))
## [1] 43

And, hey presto, 6 empty cells were filled. The logic is: Find the list of allowed values based on the values in the list of empty cells. If the length is 1, set the current cell to that value.

Now I need to get a new list of empty cells:

test <- which(mat ==0, arr.ind=T)

And I could do it again:

for(i in 1:nrow(test)){
  all <- allowed(test[i,1],test[i,2],mat)
  if(length(all)==1){
    mat[test[i,1],test[i,2]] <- all
  }
}
nrow(which(mat ==0, arr.ind=T))
## [1] 29

14 previously empty cells filled!

Lets make a function that does that again and again, until there is no more improvement.

We’ll start by writing a function that does it one time:

singlevalues <- function(mat){
  test <- which(mat ==0, arr.ind=T)
  for(i in 1:nrow(test)){
    all <- allowed(test[i,1],test[i,2],mat)
  if(length(all)==1){
    mat[test[i,1],test[i,2]] <- all
  }
  }
  return(mat)
}

Then I need to run that several times. When the number of empty cells after running it is the same as the number of empty cells before, stop. No more values can be filled in.

repsinglevalues <- function(mat){
  oldlen <- 1
  newlen <- 0
  while(oldlen > newlen){
  oldlen <- length(which(mat == 0))
  mat <- singlevalues(mat)
  newlen <- length(which(mat == 0))
  if(newlen==0) break()
  }
  return(mat)
}

Take a matrix. Set oldlen to 1, and newlen to 0. As long ans oldlen is larger than newlen, do this:
Set oldlen to the number of 0’es in the matrix. Run the singlevalues function from above. Find out how many empty cells, or zeroes there are now, and set newlen to that. If newlen is equal to 0, stop everything, otherwise repeat. And finally return the changed matrix.

Lets test it:

repsinglevalues(mat)
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
##  [1,]    4    8    3    9    2    1    6    5    7
##  [2,]    9    6    7    3    4    5    8    2    1
##  [3,]    2    5    1    8    7    6    4    9    3
##  [4,]    5    4    8    1    3    2    9    7    6
##  [5,]    7    2    9    5    6    4    1    3    8
##  [6,]    1    3    6    7    9    8    2    4    5
##  [7,]    3    7    2    6    8    9    5    1    4
##  [8,]    8    1    4    2    5    3    7    6    9
##  [9,]    6    9    5    4    1    7    3    8    2

Yeah! The sudoku is solved!

But not every puzzle can be solved in that way. Now for the brutish way…

First of all, lets get a fresh copy of the original puzzle read in. And a fresh list of empyt cells.

All righty! Nu kan jeg løse en del sudokuer alene ved denne metode. Så er der resten…

mat <- matrix(as.numeric(unlist(strsplit(prob_df[1,2],""))),9,9,byrow=TRUE)  
test <- which(mat ==0, arr.ind=T)
nrow(test)
## [1] 49

Back to the 49 empty cells.

The algorithm is as follows.

  • Set i <- 1
  • Get the row and column indeces from test[i,1] and test[i,2] respectively.
  • Identify the list of allowed values for that cell.
  • Set the value of the cell to the lowest of the allowed values.
  • Increment i <- i + 1.
  • Get the row and column indeces again.
  • Get the list of allowed values for that cell. If there are no allowed values, decrement i <- i -1.
  • If there are allowed values, set the value of the cell to the lowest of the allowed values.
  • When we return to a cell that already has a value, we should not set it to the lowest allowed value. We should set it to the next lowest. That is the lowest of the allowed values, excluding the value that was already there.
  • If we get to a cell with no allowed values (excluding the one that might already be there), the value of the cell should be set to 0, and i should be decrementet.

This function does that.

solve <- function(mat){
  i <- 1
  test <- which(mat ==0, arr.ind=T)
  while(nrow(which(mat==0,arr.ind=T))>0){
    allowedvalues <- allowed(test[i,1],test[i,2],mat)
    currentvalue <- mat[test[i,1],test[i,2]]
    allowedvalues <- allowedvalues[allowedvalues>currentvalue]
    if(length(allowedvalues)==0){
      mat[test[i,1],test[i,2]] <- 0
      i <- i -1
    } else{
      mat[test[i,1],test[i,2]] <- min(allowedvalues)
      i <- i + 1
    }
  }
  mat
  }

The function takes a matrix. It sets the counter/pointer i to 1.

Then a test-matrix is generated, containing all the positions that are empty.

While the number of empty places in the matrix is larger than zero do this:

  • Find the allowed values.
  • Find the current value in the cell.
  • Remove all allowed values that are smaller than the current value. In that way, taking the minimum of the remaining allowed values, will always give us the next lowest value. And if the original value was 0, we will get the lowest.
  • If there are no allowed values, set the value of the cell to zero, and decrement the counter.
  • If there are allowed values, set the value of the cell to the lowest of the remaining allowed values, and increment the counter.
  • End by returning the now filled matrix.

Does it work?

solve(mat)
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
##  [1,]    4    8    3    9    2    1    6    5    7
##  [2,]    9    6    7    3    4    5    8    2    1
##  [3,]    2    5    1    8    7    6    4    9    3
##  [4,]    5    4    8    1    3    2    9    7    6
##  [5,]    7    2    9    5    6    4    1    3    8
##  [6,]    1    3    6    7    9    8    2    4    5
##  [7,]    3    7    2    6    8    9    5    1    4
##  [8,]    8    1    4    2    5    3    7    6    9
##  [9,]    6    9    5    4    1    7    3    8    2

It does! Qap’la!

I now have two functions that solves sudokus. solve(), that bruteforces it and will always get a solution. And
repsinglevalues() that uses logic, gets a solution sometimes, but not always.

I also have a dataframe with 50 sudokus that I need to solve.

And what was the task again? Take the three first digits in the first row. Concatenate them to a single three digit number. Do that for all 50 puzzles. Add them all up. That should be the answer.

answer <- 0
for(i in 1:50){
  mat <- matrix(as.numeric(unlist(strsplit(prob_df[i,2],""))),9,9,byrow=TRUE)  
  mat <- repsinglevalues(mat)
  mat <- solve(mat)
  answer <<- answer + as.numeric(paste(mat[1,1:3],collapse=""))
}
  • Set a variable answer to zero.
  • Pick out the sudokus one by one.
  • Run the logic-solving function on it.
  • Then run the brute-force function on the result of that.
  • Pick out the three first digits in row 1, collapse them, cast as numeric, and add to the answer-variable.

Done!

Lessons learned

  1. Errors in the logic are a bit difficult to locate. I forgot to take into account that the list of allowed values in the brute-force solution does not include the value that is already in the cell. That made the function run for eternity. Or something pretty close. Had I not made that mistake, I would finished this problem a day earlier.
  2. Technically the lesson has not been learned yet. But below there is a note ot speed. Adding just one extra value to the puzzle reduces the time it takes to bruteforce the solution significantly.
  3. Set-functions, here specifically setdiff(), are neat!

Notes on speed

OK. I needed a function for returning a vector for getting the relevant frame in the matrix. As I mentioned, my colleague Henrik thought it should be done wit integer division rather than if-statements.

This was the way I originally did it:

ifgetint <- function(x){
if(x %in% c(1:3)){
  res <- c(1:3)
} else if(x %in% c(4:6)){
  res <- c(4:6)
  } else {  
      res <- c(7:9)
  }
  return(res)
}

But what way is faster? Lets find out:

library(microbenchmark)
mbm <- microbenchmark(ifgetint, getint, times=1000)
mbm
## Unit: nanoseconds
##      expr min lq   mean median uq   max neval cld
##  ifgetint  50 55 60.824     56 58  3906  1000   a
##    getint  52 57 78.630     59 60 17763  1000   a

Actually my original way of doing it was a little bit faster.

library(ggplot2)
autoplot(mbm)
plot of chunk unnamed-chunk-23

So. I found two methods for solving sudokus. It would be interesting to compare them. I can’t do that on just any sudoku. I need to use one that can be solved with both methods. Luckily the first problem in the set can do just that.

mat <- matrix(as.numeric(unlist(strsplit(prob_df[1,2],""))),9,9,byrow=TRUE)  
solve(mat)
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
##  [1,]    4    8    3    9    2    1    6    5    7
##  [2,]    9    6    7    3    4    5    8    2    1
##  [3,]    2    5    1    8    7    6    4    9    3
##  [4,]    5    4    8    1    3    2    9    7    6
##  [5,]    7    2    9    5    6    4    1    3    8
##  [6,]    1    3    6    7    9    8    2    4    5
##  [7,]    3    7    2    6    8    9    5    1    4
##  [8,]    8    1    4    2    5    3    7    6    9
##  [9,]    6    9    5    4    1    7    3    8    2
repsinglevalues(mat)
##       [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
##  [1,]    4    8    3    9    2    1    6    5    7
##  [2,]    9    6    7    3    4    5    8    2    1
##  [3,]    2    5    1    8    7    6    4    9    3
##  [4,]    5    4    8    1    3    2    9    7    6
##  [5,]    7    2    9    5    6    4    1    3    8
##  [6,]    1    3    6    7    9    8    2    4    5
##  [7,]    3    7    2    6    8    9    5    1    4
##  [8,]    8    1    4    2    5    3    7    6    9
##  [9,]    6    9    5    4    1    7    3    8    2
mbm <- microbenchmark(solve(mat),repsinglevalues(mat), times=100)
mbm
## Unit: milliseconds
##                  expr       min        lq      mean    median        uq
##            solve(mat) 47.681675 52.437511 55.053319 53.852274 54.621338
##  repsinglevalues(mat)  6.637329  7.001639  7.631094  7.157844  7.393044
##        max neval cld
##  171.67914   100   b
##   11.03237   100  a

No surprise there. The logical method is much faster. Something like 8 times faster.

autoplot(mbm)
plot of chunk unnamed-chunk-25

How much of a difference does it make when part of the sudoku has been solved already? By inspection (ie, trial and error) I find sudoku number 11. The original puzzle has 53 empty cells. By filling out what can be filled out using logic, that is reduced to 52 empty cells.

How much faster is it to solve sudoku number 11, when an additional cell has been filled out using logic?
mat is the original sudoku. redmat is the sudoku partially solved by logic.

i <- 11
mat <- matrix(as.numeric(unlist(strsplit(prob_df[i,2],""))),9,9,byrow=TRUE)  
redmat <- repsinglevalues(mat)

mbm <- microbenchmark(solve(mat),solve(redmat), times=10)
mbm
## Unit: milliseconds
##           expr      min       lq     mean   median       uq      max neval
##     solve(mat) 767.3325 777.2788 794.4597 781.4264 791.7815 911.4314    10
##  solve(redmat) 544.3811 546.2505 574.9207 549.2904 553.6324 690.5734    10
##  cld
##    b
##   a

Wow! That really makes a difference!! Filling in just a single extra value using logic cuts about one third of the time it takes to solve the sudoku.

autoplot(mbm)
plot of chunk unnamed-chunk-27

melt() – tidy data

Just a short note to help me remember the melt()-function.

Lets create some messy data:

ID <- 1:10
T1 <- runif(10,10,20)
T2 <- runif(10,20,30)
T3 <- runif(10,30,40)
df <- data.frame(ID=ID, T1=T1, T2=T2, T3=T3)

This is heavily inspired by a practical problem a student came to us with. There is 10 different patients, at time = T1, there is a certain value measured on the patient. At time = T2 the same value is measured. And again at time = T3, where T1<T2<T3.

We would now like to plot the development of those values as a function of time (or just T1,T2 and T3).

How to do that?

Using reshape2, and making sure we have the tidyverse packages loaded:

library(tidyverse)
library(reshape2)
clean <- melt(df, id.vars=c("ID"), value.names=c("T1", "T2", "T3"))
head(clean)
##   ID variable    value
## 1  1       T1 13.93662
## 2  2       T1 10.30468
## 3  3       T1 18.14351
## 4  4       T1 17.58294
## 5  5       T1 18.13877
## 6  6       T1 10.21993

Nice, we now have tidy data.

Note that melt() also takes a variable “variable.name”, in case we have a different sort of mess.

Now it is easy to plot:

plot <- ggplot(clean, aes(x=variable, y=value, group=ID)) +
  geom_line()
plot

plot of chunk unnamed-chunk-4

Neat.

Posted in R

Euler 49

There is a sequence 1487, 4817 and 8147 that has three properties:
1. All numbers are primes
2. They are all permutations of each other
3. Each term increases by a certain number (in this case 3330)

Project Euler claims that there are two such sequences with four-digit numbers. Our task is to find the other.
The answer is the three primes with the required properties, pasted together in ascending order.

As usual I’ll load the numbers library. And the first task must be to find all four-digit primes.

library(numbers)
cand <- Primes(1000,9999)

But we will not need these prime. We already know that answer.

cand <- cand[-which(cand %in% c(1487, 4817, 8147))]

If one of these four-digit primes is part of such a sequence, there must exist at least two other primes, which are permutations of it.
Lets begin by generating that list
The library gtools provides the function permutations()

library(gtools)

listing <- function(x){
  b <- unlist(strsplit(as.character(x),""))
  c <- permutations(4,4,v=b, set=FALSE)
  d <- apply(c, 1, paste, collapse="")
  d <- as.numeric(d)
  sort(unique(d[d %in% cand])) 
}

We take a four-digit number as input, converts it to characters via as.character(), and uses strsplit() to get a character vector with for numbers.
Then permutations. The first “4” indicates that the source vector (b) has 4 elements. The second “4” that we want results that contains 4 elements. That we take “b” as the source vector – i.e. the elements from which the permuations should be made. And set=FALSE is a lgical flag. Default is TRUE, and will remove duplicates. That is not what we want, so it is set to FALSE.
Then all the permutations are collapsed to four-digit strings, converted to numeric, and duplicates are removed with unique(), returning a list of all permuations of the digits in the input, that are in the list of candidate primes.

Thats nice. But we know that there should be three numbers in the sequence. This function simply tests if there are at least three numbers in the listing:

poss <- function(x){
  length(listing(x)) >= 3
}

That was the simple part. We can now take all the four-digit primes, excluding the three we already know about, make permutations of them, and see if there are at least three permutations among them, that are prime.

Now it gets a bit more complicated. Given a set of four-digit numbers, we need to locate a subset of three, A, B and C. The differences C-B and B-A should be equal.
There are probably more elegant ways to do that, but this is the way I found.
Given a number x, get all possible permutations of that:
listing(x)
Get all the combinations of three of these numbers:
comb(listing(x),3)
Now we have a matrix, with all the possible combinations in the columns.
Find the differences between the first and the second row, and the second and the third row, for all columns:
apply(apply(combn(listing(x), 3),2,diff)
If the two differences are equal, they fullfil the third property these numbers should have. Therefor we need to find the differende between the two differences. It that is 0, we have our result:
apply(apply(combn(listing(x), 3),2,diff),2,diff)==0])
That gives a logical vector, that is TRUE whenever the difference between the three numbers is the same. We get the numbers by:
combn(listing(x), 3)[,apply(apply(combn(listing(x), 3),2,diff),2,diff)==0]
I’ll define that as a function:

corrComb <- function(x){
  combn(listing(x), 3)[,apply(apply(combn(listing(x), 3),2,diff),2,diff)==0]  
}

It would be nice to have that as a logical test as well. I’m running out of ideas for reasonable function names:

test <- function(x){
  as.logical(sum(corrComb(x)))
}

Very simple – if there is a combination fulfilling the requirements, return TRUE.

Now, lets put it all together. I’ll use the library purrr, which allows me to use these nifty pipes.

Taking the candidate primes, we keep the ones that have three or more permutations that are prime, and in our candidate list. Of those, we keep the ones that have fulfill the requirement that the pairwise differences between three of them are equal.
Those are piped to corrComb, that returns the three numbers. They are piped to sort, so we get them in ascending order, and then I paste them together.

library(purrr)
answer <- cand %>%
  keep(poss) %>%
  keep(test) %>%
  corrComb%>%
  sort %>%
  paste(.,collapse="")

Lessons learned.

  1. Those pipes are pretty neat.
  2. The apply family is also pretty cool
  3. Read the documentation. I spent some time being confused about the results from permuations() until I discovered the set-flag

Changing class on columns i a dataframe

Given a dataframe with some columns that has a “wrong” class, eg character, that should be numeric. How to make that change.

i <- as.character(1:10)
c <- rep("a",10)
c
##  [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
df <- data.frame(a=i,b=i, c=c, d=i, stringsAsFactors = FALSE)
sapply(df,class)
##           a           b           c           d 
## "character" "character" "character" "character"

The dataframe has four columns. The columns a,b and d are characters, but should be numeric.

df[,c("a","b","d")] <- sapply(df[,c("a","b","d")], as.numeric)
sapply(df,class)
##           a           b           c           d 
##   "numeric"   "numeric" "character"   "numeric"

An important thing to note here is that apply functions a bit different from the other members of the apply-family:

apply(df,2,class)
##           a           b           c           d 
## "character" "character" "character" "character"
sapply(df,class)
##           a           b           c           d 
##   "numeric"   "numeric" "character"   "numeric"

What happens is that apply coerces df to an array. And all content in an array have to be of the same class. The lowest common denominator for that is character.

if (!require('knitr')) {
  install.packages("knitr")
}
if (!require('devtools')) {
  install.packages("devtools")
}
if (!require('RWordPress')) {
  devtools::install_github(c("duncantl/XMLRPC", "duncantl/RWordPress"))
}
library(RWordPress)
library(knitr)

options(WordPressLogin=c(ChristianKnudsen="2ZKwcFS6vD*fqWMQ"),
        WordPressURL="https://christianknudsen.info/xmlrpc.php")
knit2wp('classchange.Rmd', title = 'Changing class on columns i a dataframe',publish = FALSE)
## 
## 
## processing file: classchange.Rmd
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |.......                                                          |  11%
##   ordinary text without R code
## 
## 
  |                                                                       
  |..............                                                   |  22%
## label: unnamed-chunk-83
## Quitting from lines 9-14 (classchange.Rmd)
## Error in sink(con, split = debug): datakanalstak er fuld
Posted in R

Euler 80

Problem 80 from Project Euler.

The problem tells us that if the square root of a natural number is not an integer, it is irrational.
Project Euler also claims that it is well known. I did not know it.

We are then told that the square root of 2 is 1.4142135623730950… And that the digital sum of the first 100 digits is 475.

The task is now to take the first 100 natural numbers. And find the total of the digital sums for the first 100 digits for all the irrational square roots.

Lets begin by figuring out how to handle that many digits. R does not support more than around 15 places after the decimal point.

The library Rmpfr can handle arbitrary precision:

library(Rmpfr)
## Loading required package: gmp
## 
## Attaching package: 'gmp'
## The following objects are masked from 'package:base':
## 
##     %*%, apply, crossprod, matrix, tcrossprod
## C code of R package 'Rmpfr': GMP using 64 bits per limb
## 
## Attaching package: 'Rmpfr'
## The following objects are masked from 'package:stats':
## 
##     dbinom, dnorm, dpois, pnorm
## The following objects are masked from 'package:base':
## 
##     cbind, pmax, pmin, rbind
a <- sqrt(mpfr(2,500))

The variable a now contains the square root of 2 with a precision of 500 bytes. I’m not quite sure how many decimal places that actually translates to. But testing the following code allows me to conclude with confidence that it is at least 100.

A thing to note here is, that

a <- mpfr(sqrt(2),500)

and

a <- sqrt(mpfr(2,500))

are not equal. In the first exampel sqrt(2) is evaluated before saving the value with the high precision. Start by converting the number 2 to a high precision representation, before doing math on it.

Next is writing a function that will return the digital sum of the first 100 digits of a number.

digitsum <- function(x){
  s <- 0
  for(i in 1:100){
    s <- s + floor(x)
    x <- (x - floor(x))*10
  }
  s
}

First s is initialized to 0. Then floor(x) gives us the first digit in x. We add that to s, and subtract it from x, and multiply by 10. Repeat that 100 times, and you get the sum of the first 100 digits in x.

Let us test that it works. Project Euler told us what the result for sqrt(2) is:

digitsum(a)
## 1 'mpfr' number of precision  500   bits 
## [1] 475

Nice, the correct result (not that that guarantees that I’ve done everything correctly).

Now, lets find all the irrational square roots we need to look at:

library(purrr)
t <- 1:100
s <- t %>%
  keep(~as.logical(sqrt(.x)%%1))

I need to practice this way of coding a bit more. t contains the first 100 natural numbers. I pass that to the keep()-function, and the predicate function takes the square root of each number, take the modulus 1, and convert it to a logical value. If the modulus of the square root is 0, the square root is an integer, and 0 i false. So we’re keeping all the non-integer squareroots.

Now I’ll convert all the natural numbers to the mpfr-class. The next line takes the square root. The third line calculate the digitalsum. And the final line gives us the result.

s <- mpfr(s,500)
r <- sqrt(s)
r <- digitsum(r)
sum(r)
## 1 'mpfr' number of precision  500   bits 
## [1] Censored

Lessons learned:
Rmpfr allows us to work with (more or less) arbitrary precision.
But we need to convert numbers to the relevant class before doing math on it.