Digital Natives

One can only hope that the concept “Digital Natives” will soon be laid to rest. Or at least all the ideas about what they can do.

A digital native is a person that grows up in the digital age, in contrast to digital immigrants, that got their familiarity with digital systems as an adult.

And there are differences. Digital natives assumes that everything is online. Stuff that is not online does not exist. Their first instinct is digital.

However, in the library world, and a lot of other places, the idea has been, that digital natives, because they have never experienced a world without computers, groks them. That they just know how to use them, and how to use them in a responsible and effective way.

That is, with a technical term, bovine feces. And for far too long, libraries (and others) have ignored the real needs, assuming that there was now suddenly no need for instruction in IT-related issues. Becase digital natives.

Being a digital native does not mean that you know how to code.

Being a digital native does not mean that you know how to google efficiently.

Being a digital native does not mean that you are magically endowed with the ability to discern fake news from facts.

I my self is a car native. I have grown up in an age where cars were ubiquitous. And I still had to take the test twice before getting my license. I was not able to drive a car safely, just because I have never known a world without cars. Why do we assume that a digital native should be able to use a computer efficiently?

The next project

For many years, from 1977 to 2006, there was a regular feature in the journal for the Danish Chemical Society. “Kemiske småforsøg”, or “Small chemical experiments”. It was edited by the founder of the Danish Society for Historical Chemistry, and contained a lot of interesting chemistry, some of it with a historical angle.

The Danish Society for Historical Chemistry is considering collecting these experiments, and publishing them. It has been done before, but more experiments were published after that.

We still don’t know if we will be allowed to do it. And it is a pretty daunting task, as there are several hundred experiments. But that is what I’m spending my free time on at the moment. If we get i published, it will be for sale at the website of the Danish Society for Historical Chemistry.

Project Euler 39

Project Euler 39

We’re looking at Pythagorean triplets, that is equations where a, b and c are integers, and:

a2 + b2 = c2

The triangle defined by a,b,c has a perimeter.

The triplet 20,48,52 fulfills the equation, 202 + 482 = 522. And the perimeter of the triangle is 20 + 48 + 52 = 120

Which perimeter p, smaller than 1000, has the most solutions?

So, we have two equations:

a2 + b2 = c2

p = a + b + c

We can write

c = p – a – b

And substitute that into the first equation:

a2 + b2 = (p – a -b)2

Expanding the paranthesis:

a2 + b2 = p2 – ap – bp – ap + a2 + ab – bp + ab + b2

Cancelling:

0 = p2 – 2ap – 2bp + 2ab

Isolating b:

0 = p2 – 2ap – b(2p – 2a)

b(2p – 2a) = p2 – 2ap

b = (p2 – 2ap)/(2p – 2a)

So. For a given value of p, we can run through all possible values of a and get b. If b is integer, we have a solution that satisfies the constraints.

The smallest value of a we need to check is 1. But what is the largest value of a for a given value of p?

We can see from the pythagorean equation, that a =< b < c. a might be larger than b, but we can then just switch a and b. So it holds. What follows from that, is that a =< p/3.

What else? If a and b are both even, a2 and b2 are also even, then c2 is even, and then c is even, and therefore p = a + b + c is also even.

If a and b are both uneven, a2 and b2 are also uneven, and c2 is then even. c is then even. And therefore p = a + b + c must be even.

If either a or b are uneven, either a2 or b2 is uneven. Then c2 is uneven, and c is then uneven. Therefore p = a + b + c must be even.

So. I only need to check even values of p. That halves the number of values to check.

Allright, time to write some code:

current_best_number_of_solutions <- 0

for(p in seq(2,1000,by=2)){
  solutions_for_current_p <- 0
  for(a in 1:ceiling(p/3)){
    if(!(p**2-2*a*p)%%(2*p-2*a)){
      solutions_for_current_p <- solutions_for_current_p + 1
    }
  }
  if(solutions_for_current_p > current_best_number_of_solutions){
    current_best_p <- p
    current_best_number_of_solutions <- solutions_for_current_p
   }
}

answer <- current_best_p

current_best_number_of_solutions is initialized to 0.

For every p from 2 to 1000, in steps of 2 (only checking even values of p), I set the number of solutions_for_current_p to 0.

For every value a from 1 to p/3 – rounded to to an integer: If !(p2-2*a*p)%%(2*p-2*a) is true, that is, if the remainder of (p2-2*a*p)/(2*p-2*a) is 0, I increment the solutions_for_current_p.

After running through all possible values of a for the value of p we have reached in the for-loop:

If the number of solutions for this value of p is larger, than the previous current_best_number_of_solutions, we have found a value of p that has a higher number of solutions than any previous value of p we have examined. In that case, set the current_best_p to the current value of p. And the current_best_number_of_solutions to the number of solutions we have found for the value of p.

If not, dont change anything, reset solutions_for_current_p and check a new value of p.

Project Euler 4

A palindromic number is similar to a palindrome. It is the same read both left to right, and right to left.

Project Euler tells us, that the largest palindrom made from the product of two 2-digit numbers is 9009. That number is made by multiplying 91 and 99.

I must now find the largest palindrome, made from the product of two 3-digit numbers.

What is given, is that the three digit numbers cannot end with a zero.

There are probably other restrictions as well.

I’ll need a function that tests if a given number is palindromic.

palindromic <- function(x){
  sapply(x, function(x) (str_c(rev(unlist(str_split(as.character(x),""))), collapse="")==as.character(x)))
}

The function part converts x to character, splits it in individual characters, unlists the result, reverses that, and concatenates it to a string. Then it is compared to the original x – converted to a character.
The sapply part kinda vectorises it. But it is still the slow part.

If I could pare the number of numbers down, that would be nice.

One way would be to compare the first and last digits in the number.

first_last <- function(x) { 
  x %/% 10^(floor(log10(x))) == x%%10
}

This function finds the number of digits – 1 in x. I then modulo-divide the number by 10 to the number of digits minus 1. That gives me the first digit, that I compare with the last. If the first and the last digit is the same – it returns true.

Now I am ready. Generate a vector of all three-digit numbers from 101 to 999. Expand the grid to get all combinations. Convert to a tibble,
filter out all the three-digit numbers that end with 0. Calculate a new column as the multiplication of the two numbers, filter out all the results where the first and last digit are not identical, and then filter out the results that are not palindromic. Finally, pass it to max (using %$% to access the individual variables), and get the result.

library(dplyr)
library(magrittr)

res <- 101:999 %>% 
  expand.grid(.,.) %>% 
  as_tibble() %>% 
  filter(Var1 %% 10 != 0, Var2 %% 10 != 10) %>% 
  mutate(pal = Var1 * Var2) %>% 
  filter(first_last(pal)) %>% 
  filter(palindromic(pal)) %$% 
  max(pal)

There are probably faster ways of doing this…

Data leaks

When is data anonymous? That is a very good question, and one that is increasingly relevant for my work.

Our datalabs at the University Library of Copenhagen (or whatever our current name is), is beginning to be a success. We have a steady increase in the number of students and researchers from the health sciences. And that triggers a recurring discussion.

Let me begin by noting that our users are very conscious of the issues regarding protecting sensitive information.They use encrypted hardware, secure connections to a degree I have only ever seen amongst people security consciuos enough to border on tinfoil folks. But they are still a bit naive about anonomyzing data.

I have no idea how to anonymize data. And the more I read about it, the less sure I am that it is actually possible. People smarter than me are probably able to figure out something. But I fear that this is a game rather like the DRM-games.

Yes, the studios can encrypt their Bluray discs. But they still need to be able to show the movie on a screen. The disc will have to be decrypted somehow. Otherwise it will just show static. And the data you are working with may be stripped of all identifying information. But there still needs to be information in it. Otherwise it is just useless.

So – I cannot advice our students on how to de-identify patients in a clinical study. But I can tell them horror stories. And I do. These are a few of them.

Netflix and IMDB

The classic story is the de-identification of Netflix users. Netflix has periodically released data on their users. Anonymized of course. Which movies have a given user watched, and how has that user rated them.

Another source about information on what movies a person has watched and rated is IMDB. And that information is not so secret. Let us asume that an unknown person has watched ten obscure movies on Netflix, and given the first five a high rating and the others a low. And that a known person on IMBD has rated the same five obscure movies high, and the other five low. Intuition would suggest that those two persons are the same. Is that a problem?

If you live in an area where being gay is a problem, you might not have problem people knowing that you have watched obscure, but innocent movies on IMDB. But the Netflix data, if linked to you, would reveal that you also have watched Another Gay Movie, Philadelphia and Milk. That might be a problem. I don’t think “Salo” is on Netflix. And I’m not necessarily that embarrassed to admit that I have watched it. But I would probably not want people to know that I have watched it ten times (if I had. Its horrifying). Heres a paper on the case.

Postcodes

A lot of demographic data is released to the public. We want people to know if living in a certain area causes cancer. And we want the underlying data out there, because there is just too much data to analyze, so if we could crowdsource that part of the process, it would be nice. So we anonymize the data, but leave in the postcodes. That might be a problem.

The danish postcode 1301 corresponds to the street “Landgreven”. According to www.krak.dk, 17 persons have an adress there. There might be a bit more. They only register people with a phonenumber. And leaves out people with an unlisted number. But let us assume that there are only those 17 persons. 8 of them are women. So if we have health data on medical procedures – broken down by postcode and gender, we might be able to say that one of 8 named women had an abortion. Not that there is much stigma associated with that in Denmark, or at least there shouldnt be. But it is still something you probably would like to keep to yourself.

Twitter, Flickr and graphs

Some people like to be anonomous on Twitter. Looking at the name-calling, flamewars and general pouring crap over people you disagree with on Twitter, it is surprising that not more people are trying to be anonymous on Twitter. But some people have legitimate reasons to try to be anonymous. Whistleblowers, human rigts activists etc.

Social media are characterized by graphs. Not pie charts and such. But networks. Each person is a node, and each connection, following eg, between nodes is an edge. The network defined by nodes and edges is called a graph. Two researchers Narayanan and Shmatikov have made an interesting study, “De-anonymizing social networks”. Take a lot of persons that have accounts on both Twitter and Flickr. Anonymise the Twitter accounts. One third of those Twitter accounts can be linked to the Flickr acocunt of the same person. In spite of the anonymisation.

How? Well, the graph describing who you follow and who follows you on Twitter, will share characteristics with the graph on Flickr. And those graphs are pretty unique. Read more here.

 

Vision. Mission. Strategy. Tactics? Values?

I have a love-hate relationship with visions and missions. The ones that companies and organizations spend a lot of time and ressources on developing. They often have a rather formulaic form:

We exist to restore intellectual capital whilst continuing to collaboratively coordinate information.

This is actually a mission statement from a mission statement generator.

I do love the idea of missions, visions and strategies. It speaks to the engineer in me. We should define the goal we want to achieve, and then break that goal down into individual work-packages. When we have completed all of them, we have achieved our goal. The logical framework approach is a good example.

I also like values. I need consistency. Or, I don’t need it, but it is important to me. I like people and organizations to actually be consistent in their actions. Walk the talk! If you want a work environment that does not discriminate – do not discriminate. Anyone. In any way, I do not really mind that you discriminate based on gender. Go ahead. Just be honest about it. In reality, I will of course hate you if you discriminate based on gender. But the hate will take a more deep and incandecent nature if you discriminate based on gender, while claiming that you are all about equality.

So – I love strategies. I love visions. I love missions. I love values.

On the other hand. I hate them. Most of them are exactly like the example above. An example that is taken from the mission statement generator. Noone is able to explain the difference between mission and vision, and the values all falls for the negation test. And tend to end up being rather tautological. Often they are self contradictory. You can have loyalty in all situations. Or you can have honesty in all situations. But you can’t have both. Sometimes being loyal means holding back on honesty a bit. How do you prioritize your values? I have never seen a description of a hierarchy.

Usually it doesn’t matter. Most values in organizations tend to be nothing more than hot air. And as hot air, easy to dismiss when it is opportune. Try it yourself. Tell your boss that her decision is wrong, because it is in conflict the the defined values of the organization.

So – I love values. But not when they are meaningless.

And if they are – don’t bother defining them. They are just going to be a waste of time.

Mann-Whitney-wilcoxon test

The Mann-Whitney U test,

also known as the Mann-Whitney-Wilcoxon test, is a non-parametric test of a null hypothesis, that it is equally likely, that a randomly selected value from one sample, will be less than, or greater, than a randomly selected value from a second sample.

Intuitively I think I might be able to see how this translates to “the two samples are from populations that are actually different”. I have a harder time to describe that intuition.

The test is nearly as efficient as a t-test. But the advantadge is, that it does not require that we can assume that the two samples are normal distributed.

It can also be used to test if two independent samples, were selected from populations that have the same distribution.

The idea is, we have to samples. If we pick a random value from sample A, and a random value from sample B, there is a likelihood, a chance, that the value from sample A is larger than the value from sample B. Is that likelihood the same as the likelihood that the value from sample A is smaller than the value from sample B.

How to do that in R?

Let us look at some data. mtcars is a dataset with some facts about various 1974 US cars.
One of the numbers we have i mpg, miles pr gallon, or fuel-efficiency. Another is about the transmission. Is there a manual or an automatic gearbox. In the column am a “1” indicates an automatic gearbox. A “0” a manual gearbox.

We can now extract two sets of data on the fuel efficiency, based on the type of transmission:

aut <- mtcars[which(mtcars$am==1),]$mpg
man <- mtcars[which(mtcars$am==0),]$mpg

In “aut” we have the fuel efficiency of cars with an automatic gearbox, and in “man” the same, but for the, in Europe, more familiar, manual gearbox.

How to do the Wilcoxon test?

wilcox.test(aut,man)
## Warning in wilcox.test.default(aut, man): cannot compute exact p-value with
## ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  aut and man
## W = 205, p-value = 0.001871
## alternative hypothesis: true location shift is not equal to 0

The p-value is 0.001871. If we take a 0.05 significance level, we can conclude that the two sets of values are from non-identical populations. Had it been larger (quite a bit larger), we would not have been able to reject the null-hypothesis, that the two samples comes from identical populations.

There is a different way to do the test, where we dont split the dataset in two:

wilcox.test(mpg ~ am, data=mtcars)
## Warning in wilcox.test.default(x = c(21.4, 18.7, 18.1, 14.3, 24.4, 22.8, :
## cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  mpg by am
## W = 42, p-value = 0.001871
## alternative hypothesis: true location shift is not equal to 0

We tell the function that we want to test mpg (fuel efficiency), as a function of am (the type of transmission), and that the dataset we are working on is mtcars. The function then splits the dataset in two based on the value in am. Had there been more than two, things would probably stop working.

The result is the same though.

The warning

We get a warning. That is because the algorithm has a problem when there are identical values in the two datasets. There is a way around. We can add a “jitter”, to the two datasets. jitter() adds small random values to the values in the dataset. Often that solves the problem:

wilcox.test(jitter(aut),jitter(man))
## 
##  Wilcoxon rank sum test
## 
## data:  jitter(aut) and jitter(man)
## W = 205, p-value = 0.001214
## alternative hypothesis: true location shift is not equal to 0

On the other hand, we are no longer comparing the values in the dataset. We can see the difference here:

aut - jitter(aut)
##  [1] -0.051431873 -0.070007257 -0.036621231 -0.047522075 -0.039480170
##  [6]  0.056659217  0.057724536 -0.003120035 -0.033923965 -0.019053550
## [11]  0.028977073  0.001790751  0.045607303

It would probably be a bad idea to add random noise to the data. On the other hand, it is not very likely that two cars have exactly the same fuel efficiency. We have probably “binned” the values, and the addition of random noise would probably not change the values we are working on too much. But you should always consider why the warning arises, if it is actually a problem and not just a warning, and if it is appropriate to solve it in this way.

Posted in R

Project Euler 187

Project Euler 187. Semiprimes

I had to look it up. Semiprimes are numbers that are the product of two prime numbers. And only two, although they may be equal.

There are ten of them below 30: 4, 6, 9, 10, 14, 15, 21, 22, 25 and 26.

16 is not. The only primefactor is 2, but it occurs four times.

How many of these semiprimes are there below 108?

That should be pretty straightforward: Generate all primes below 108, make all the multiplications, and count how many uniqe numbers there are below n, where n=108.

One problem:

n <- 10**8
numbers <- primes(n)
length(numbers)
## [1] 5761455

That is a lot of numbers to multiply.

A trick: 2 times all the primes below n/2 will give all the semiprimes that have 2 as one of the primefactors (smaller than n).

3 times all the primes below n/3 will in the same way give all the semiprimes, that have 3 as one of the primefactors.

If I can figure out how many primes there are below n/2, I get the number of semiprimes that has 2 as one of the two primefactors. The same for the number of primes below n/3. If continue that to \(\sqrt(n)\), and add them all together, I should get the total number of semiprimes below n.

One issue though. The first prime below n/2 that I multiply by 2, is 3. And the first prime below n/3 that I multiply by 3 is 2. Both giving 6. I need to figure out how to only count 6 one time.

I just generated all the primes below n. The number of primes below n/2 is:

length(numbers[numbers<n/2])
## [1] 3001134

And the number of primes below n/3 is:

length(numbers[numbers<n/3])
## [1] 2050943

I do want to multiply 3 by 3. But I need to exclude 2.

length(numbers[numbers<n/3 & numbers>=3])
## [1] 2050942

Qap’la! One less prime.

I just need to do it for all of them.

n_semi_primes <- function(x){
  counter <- 0
  for(i in numbers[numbers<=sqrt(x)]){
    counter <- counter + length(numbers[numbers<x/i & numbers>=i])
  }
  return(counter)
}

I’m writing this as a function, taking a limit x. A counter is set to 0. And for all primes i less than \(\sqrt(n)\), I add the number of primes between i and < x/i.

I can test it on the example given:

n_semi_primes(30)
## [1] 10

That was the number of semiprimes below 30. And then it is just a question of running it on 108:

answer <- n_semi_primes(10**8)

Project Euler problem 62

Euler problem 62

The cube, 41063625 (3453), can be permuted to produce two other cubes: 56623104 (3843) and 66430125 (4053). In fact, 41063625 is the smallest cube which has exactly three permutations of its digits which are also cube.

Find the smallest cube for which exactly five permutations of its digits are cube.

Alright. I need to find five cubes, that are permutations of the same digits.

How to check if two numbers are permutations of each other?

We can generate the largest permutation of a given number. If the largest permutation of two numbers are identical, the two numbers are permutations of each other.

So I need a function, that returns the largest permutation of a number. It would be nice, if that function was vectorized.

max_perm <- function(t){
  require(magrittr)
  options(scipen=5)
  t %>% 
    as.character() %>% 
    str_split("") %>% 
    lapply(sort, decreasing=TRUE) %>% 
    lapply(paste0, collapse="") %>% 
    unlist() %>% 
    as.double()
}

Convert the input to character. Split at “”. That returns a list with vectors containing the individual digits of the input. lapply sorts the individual vectors in the list in decreasing order. Then lapply pastes the elements in each vector together with paste0 and “” as the separator. Then it is unlisted, and returned as numeric.

What is worth noting is a thing I was struggling with for far too long. R likes to write numbers in scientific notation. As in “1e+06”. I have not studied the phenomenon in detail. But options(scipen=5) solves the problem. It is the “penalty” used to decide when a number should be written in scientific notation. Unless I change that (trial and error, but it should be larger than whatever is default), as.character(1000000) will return “1e+06”. And the permutations of “1” “e” “+” “0” “6” are not terribly useful in this context.

I’m hazarding a guess that I don’t need to handle cubes of values of more than four digits.

Beginning with a vector of all numbers from 1 to 9999, I convert it to a dataframe. I transmute the first column to a column with the name x.
Then I mutate a second column, cube, into existence, and calculate it as the cube of the x-value. A third column, max_cube, is mutated with the result from my max_perm function above. And tha column is immediately used to group the data, so I get date grouped by identical maximum values of the permutations. I filter on the count of those groups, and only keep the groups that contain 5 elements. Then I ungroup it, and select just the cube column.

I now have a data frame with a single column containing 10 values. They are all cubes, five of them are permutations of each other. The other five are also permutaions of each other. And now I just have to take the smallest of them.

result <- 1:9999 %>% 
  as.double() %>% 
  as.data.frame() %>% 
  transmute(., x = .) %>% 
  mutate(cube = x**3) %>% 
  mutate(max_cube = max_perm(cube)) %>% 
  group_by(max_cube) %>% 
  filter(n()==5) %>% 
  ungroup() %>% 
  select(cube) %>% 
  min()

Before I print the result, so I can enter it into Project Euler, I set options(digits=17).

Done! A good exercise. And a valuable lesson in the importance of the options in R.