Starlog – 2

The course “Star Trek: Inspiring Culture and Technology” asks me to do a media analysis. Or – write about Star Trek. The latter interpretation is much more interesting 🙂

The question posed is:

“Which pilot, best adresses the contemporary societal issues from when it was produced while taking the most advantage of the television format on which it was shown? Rank the episodes you watch in mumerical order, where 1 is the episode that best answers the question prompt.”

  • The episodes are:
  •  “The Cage” – TOS (first pilot)
  • “Where No Man Has Gone Before” – TOS (second pilot)
  • “Encounter at Far Point” – TNG
  • “Emissary” – DS9
  • “Caretaker” – VOY
  • “Broken Bow” – ENT
  • “The Vulcan Hello” – DIS

Before answering, it might be worth noting, that watching Star Trek from outside the US, actually makes it a bit difficult. What were the contemporary societal issue in the US in 1987? And how does that relate to “Encounter at Far Point”? Those of us living in the rest of the world (96% of it) do have a pretty good idea about the current societal issues in the US. And as a die-hard trekkie, it is pretty easy to figure out what they were. Just look at the issues treated in Star Trek. On the other hand – those are the issues that we notice today, and might reflect the issues that we today think are important, were important, should be important, or should have been important.

Anyway, here goes.

  1. Discovery. Women, women everywhere! An issue that is clearly perceived as important today is female underrepresentation in media, and other places. The gender-atypical name “Michael” of the protagonist also speaks to the societal issues of trans-rights, and we finally saw a gay couple on-screen in Star Trek.
  2. TNG. Not quite yet out of the cold war, humanity is on trial for our past transgressions. We are being held accountable for our wrongs, by an omnipotent being. In a post-apocalyptic setting, after a nuclear war. I would say it adresses the fears of war and the growing awareness of environmental disaster.
  3. DS9. The first black captain! Also religion is treated quite different from what we have previously seen.
  4. ENT. My best guess is the race-issue. We are confronted with a very different culture, that is, to some extent a threat to humanity. At the same time humanity is put in is place, or rather tried to be put into our place, by a superior race.
  5. TOS. “Where no man has gone before”. A black woman in command! A person of, probably, japanese descent, presented to an audience that must have grown up learning that Japan was an existential threat to the US. Both on the bridge, in positions of relative authority.
  6. TOS – especially “The Cage”. To be honest. To cerebral. The only issue I can find is the female number one.
  7. VOY. A female captain. Feminism takes center stage in Star Trek. But last on my list, because we have already seen strong female characters in all the previous series, at times where these issues were, or perhaps should have been, more pressing.

As to the use of the television format. I do not really see a difference between the series. All of them where first broadcast on a single channel, and then went into syndication. Even Discovery is basically broadcast on a single channel. The main difference is that we do not have to tune in at a certain time, but can watch the episodes at our leasure. That might make it easier to gather new audiences. The main difference is probably that the way the stories are told, streamed or not, has changed. The long story archs gives room for more character development, compared to the more episodic storytelling of previous series. But that change began before streaming. Enterprise has long story archs as well, as do DS9.

Starlog

Alright, for the next several weeks, I’ll be taking a break for R, data-analysis, visualizations and other stuff on this blog.

This will be my StarLog.

I’ve joined a course on EdX: “Star Trek: Inspiring Culture and Technology”, and this is my homework 🙂

First job is to introduce myself, and what brought me to this course.

My name is Christian Knudsen, I am 44 years old. And I did not grow up with Star Trek. Actually I was a bit dismissive of trekkies. Such geeks! I had watched a few episodes of TNG on danish television, and had a bit of a crush on Wesley. But never more than that. About 14 years ago, I embraced my inner trekkie. A complete season of DS9 was on sale. I don’t remember which, but I bought it. Watched it. And was hooked. I acquired the complete collection, and watched it all. I watched it in order. I watched it in order with my husband again. And currently we are watching it in StarDate order. We have been to cons. The priest referenced Star Trek at our wedding. There is a Star Trek quote inscribed in our rings.

To me, Star Trek is the promise of a bright future. The promise, that humanity has a chance. That we will, somehow, overcome our problems. Like all good science fiction, it tackles current issues, and provoces thought. Living in a liberal, social-democratic country in northern Europe, it is sometimes difficult to understand exactly why a controversial subject is that controversial. The progressive themes are not necessarily viewed as that progressive in a Danish context. That also gives an interesting perspective on an american culture, that can appear very alien to outsiders.

And that was the way I got here. The final ingredient was a link on a danish Star Trek group on Facebook!

And with that, I earned my promotion to CWO!

Where to see Great Pandas

Zoos with Great Pandas in their exhibitions, as pr. medio March ’19.

And Copenhagen Zoo, which will get their pandas in april.

Corresponding value to a max-value

One of our users need to find the max-value of a variable. He also needs to find the corresponding value in another variable.
As in – the maximum value in column A is in row 42. What is the value in column B, row 42.

And of course we need to do it for several groups.

Let us begin by making a dataset. Four groups in id,

library(tidyverse)
id <- 1:3
val <- c(10,20)
kor <- c("a", "b", "c")


example <- expand.grid(id,val) %>% 
  as_tibble() %>% 
  arrange(Var1) %>% 
  cbind(kor, stringsAsFactors=F) %>% 
  rename(group=Var1, value=Var2, corr = kor)

example
##   group value corr
## 1     1    10    a
## 2     1    20    b
## 3     2    10    c
## 4     2    20    a
## 5     3    10    b
## 6     3    20    c

We have six observations, divided into three groups. They all have a value, and a letter in “corr” that is the corresponding value we are interested in.

So. In group 1 we should find the maximum value 20, and the corresponding value “b”.
In group 2 the max value is stil 20, but the corresponding value we are looking for is “a”.
And in group 3 the max value is yet again 20, but the corresponding value is now “c”.

How to do that?

example %>%
  group_by(group) %>% 
  mutate(max=max(value)) %>% 
  mutate(max_corr=corr[(value==max)]) %>% 
  ungroup()
## # A tibble: 6 x 5
##   group value corr    max max_corr
##   <int> <dbl> <chr> <dbl> <chr>   
## 1     1   10. a       20. b       
## 2     1   20. b       20. b       
## 3     2   10. c       20. a       
## 4     2   20. a       20. a       
## 5     3   10. b       20. c       
## 6     3   20. c       20. c

The maximum value for all groups is 20. And the corresponding value to that in the groups is b, a and c respectively.

Isn't there an easier solution using summarise function? Probably. But our user needs to do this for a lot of variables. And their names have nothing in common.

Digital Natives

One can only hope that the concept “Digital Natives” will soon be laid to rest. Or at least all the ideas about what they can do.

A digital native is a person that grows up in the digital age, in contrast to digital immigrants, that got their familiarity with digital systems as an adult.

And there are differences. Digital natives assumes that everything is online. Stuff that is not online does not exist. Their first instinct is digital.

However, in the library world, and a lot of other places, the idea has been, that digital natives, because they have never experienced a world without computers, groks them. That they just know how to use them, and how to use them in a responsible and effective way.

That is, with a technical term, bovine feces. And for far too long, libraries (and others) have ignored the real needs, assuming that there was now suddenly no need for instruction in IT-related issues. Becase digital natives.

Being a digital native does not mean that you know how to code.

Being a digital native does not mean that you know how to google efficiently.

Being a digital native does not mean that you are magically endowed with the ability to discern fake news from facts.

I my self is a car native. I have grown up in an age where cars were ubiquitous. And I still had to take the test twice before getting my license. I was not able to drive a car safely, just because I have never known a world without cars. Why do we assume that a digital native should be able to use a computer efficiently?

The next project

For many years, from 1977 to 2006, there was a regular feature in the journal for the Danish Chemical Society. “Kemiske småforsøg”, or “Small chemical experiments”. It was edited by the founder of the Danish Society for Historical Chemistry, and contained a lot of interesting chemistry, some of it with a historical angle.

The Danish Society for Historical Chemistry is considering collecting these experiments, and publishing them. It has been done before, but more experiments were published after that.

We still don’t know if we will be allowed to do it. And it is a pretty daunting task, as there are several hundred experiments. But that is what I’m spending my free time on at the moment. If we get i published, it will be for sale at the website of the Danish Society for Historical Chemistry.

Project Euler 39

Project Euler 39

We’re looking at Pythagorean triplets, that is equations where a, b and c are integers, and:

a2 + b2 = c2

The triangle defined by a,b,c has a perimeter.

The triplet 20,48,52 fulfills the equation, 202 + 482 = 522. And the perimeter of the triangle is 20 + 48 + 52 = 120

Which perimeter p, smaller than 1000, has the most solutions?

So, we have two equations:

a2 + b2 = c2

p = a + b + c

We can write

c = p – a – b

And substitute that into the first equation:

a2 + b2 = (p – a -b)2

Expanding the paranthesis:

a2 + b2 = p2 – ap – bp – ap + a2 + ab – bp + ab + b2

Cancelling:

0 = p2 – 2ap – 2bp + 2ab

Isolating b:

0 = p2 – 2ap – b(2p – 2a)

b(2p – 2a) = p2 – 2ap

b = (p2 – 2ap)/(2p – 2a)

So. For a given value of p, we can run through all possible values of a and get b. If b is integer, we have a solution that satisfies the constraints.

The smallest value of a we need to check is 1. But what is the largest value of a for a given value of p?

We can see from the pythagorean equation, that a =< b < c. a might be larger than b, but we can then just switch a and b. So it holds. What follows from that, is that a =< p/3.

What else? If a and b are both even, a2 and b2 are also even, then c2 is even, and then c is even, and therefore p = a + b + c is also even.

If a and b are both uneven, a2 and b2 are also uneven, and c2 is then even. c is then even. And therefore p = a + b + c must be even.

If either a or b are uneven, either a2 or b2 is uneven. Then c2 is uneven, and c is then uneven. Therefore p = a + b + c must be even.

So. I only need to check even values of p. That halves the number of values to check.

Allright, time to write some code:

current_best_number_of_solutions <- 0

for(p in seq(2,1000,by=2)){
  solutions_for_current_p <- 0
  for(a in 1:ceiling(p/3)){
    if(!(p**2-2*a*p)%%(2*p-2*a)){
      solutions_for_current_p <- solutions_for_current_p + 1
    }
  }
  if(solutions_for_current_p > current_best_number_of_solutions){
    current_best_p <- p
    current_best_number_of_solutions <- solutions_for_current_p
   }
}

answer <- current_best_p

current_best_number_of_solutions is initialized to 0.

For every p from 2 to 1000, in steps of 2 (only checking even values of p), I set the number of solutions_for_current_p to 0.

For every value a from 1 to p/3 – rounded to to an integer: If !(p2-2*a*p)%%(2*p-2*a) is true, that is, if the remainder of (p2-2*a*p)/(2*p-2*a) is 0, I increment the solutions_for_current_p.

After running through all possible values of a for the value of p we have reached in the for-loop:

If the number of solutions for this value of p is larger, than the previous current_best_number_of_solutions, we have found a value of p that has a higher number of solutions than any previous value of p we have examined. In that case, set the current_best_p to the current value of p. And the current_best_number_of_solutions to the number of solutions we have found for the value of p.

If not, dont change anything, reset solutions_for_current_p and check a new value of p.

Project Euler 4

A palindromic number is similar to a palindrome. It is the same read both left to right, and right to left.

Project Euler tells us, that the largest palindrom made from the product of two 2-digit numbers is 9009. That number is made by multiplying 91 and 99.

I must now find the largest palindrome, made from the product of two 3-digit numbers.

What is given, is that the three digit numbers cannot end with a zero.

There are probably other restrictions as well.

I’ll need a function that tests if a given number is palindromic.

palindromic <- function(x){
  sapply(x, function(x) (str_c(rev(unlist(str_split(as.character(x),""))), collapse="")==as.character(x)))
}

The function part converts x to character, splits it in individual characters, unlists the result, reverses that, and concatenates it to a string. Then it is compared to the original x – converted to a character.
The sapply part kinda vectorises it. But it is still the slow part.

If I could pare the number of numbers down, that would be nice.

One way would be to compare the first and last digits in the number.

first_last <- function(x) { 
  x %/% 10^(floor(log10(x))) == x%%10
}

This function finds the number of digits – 1 in x. I then modulo-divide the number by 10 to the number of digits minus 1. That gives me the first digit, that I compare with the last. If the first and the last digit is the same – it returns true.

Now I am ready. Generate a vector of all three-digit numbers from 101 to 999. Expand the grid to get all combinations. Convert to a tibble,
filter out all the three-digit numbers that end with 0. Calculate a new column as the multiplication of the two numbers, filter out all the results where the first and last digit are not identical, and then filter out the results that are not palindromic. Finally, pass it to max (using %$% to access the individual variables), and get the result.

library(dplyr)
library(magrittr)

res <- 101:999 %>% 
  expand.grid(.,.) %>% 
  as_tibble() %>% 
  filter(Var1 %% 10 != 0, Var2 %% 10 != 10) %>% 
  mutate(pal = Var1 * Var2) %>% 
  filter(first_last(pal)) %>% 
  filter(palindromic(pal)) %$% 
  max(pal)

There are probably faster ways of doing this…

Data leaks

When is data anonymous? That is a very good question, and one that is increasingly relevant for my work.

Our datalabs at the University Library of Copenhagen (or whatever our current name is), is beginning to be a success. We have a steady increase in the number of students and researchers from the health sciences. And that triggers a recurring discussion.

Let me begin by noting that our users are very conscious of the issues regarding protecting sensitive information.They use encrypted hardware, secure connections to a degree I have only ever seen amongst people security consciuos enough to border on tinfoil folks. But they are still a bit naive about anonomyzing data.

I have no idea how to anonymize data. And the more I read about it, the less sure I am that it is actually possible. People smarter than me are probably able to figure out something. But I fear that this is a game rather like the DRM-games.

Yes, the studios can encrypt their Bluray discs. But they still need to be able to show the movie on a screen. The disc will have to be decrypted somehow. Otherwise it will just show static. And the data you are working with may be stripped of all identifying information. But there still needs to be information in it. Otherwise it is just useless.

So – I cannot advice our students on how to de-identify patients in a clinical study. But I can tell them horror stories. And I do. These are a few of them.

Netflix and IMDB

The classic story is the de-identification of Netflix users. Netflix has periodically released data on their users. Anonymized of course. Which movies have a given user watched, and how has that user rated them.

Another source about information on what movies a person has watched and rated is IMDB. And that information is not so secret. Let us asume that an unknown person has watched ten obscure movies on Netflix, and given the first five a high rating and the others a low. And that a known person on IMBD has rated the same five obscure movies high, and the other five low. Intuition would suggest that those two persons are the same. Is that a problem?

If you live in an area where being gay is a problem, you might not have problem people knowing that you have watched obscure, but innocent movies on IMDB. But the Netflix data, if linked to you, would reveal that you also have watched Another Gay Movie, Philadelphia and Milk. That might be a problem. I don’t think “Salo” is on Netflix. And I’m not necessarily that embarrassed to admit that I have watched it. But I would probably not want people to know that I have watched it ten times (if I had. Its horrifying). Heres a paper on the case.

Postcodes

A lot of demographic data is released to the public. We want people to know if living in a certain area causes cancer. And we want the underlying data out there, because there is just too much data to analyze, so if we could crowdsource that part of the process, it would be nice. So we anonymize the data, but leave in the postcodes. That might be a problem.

The danish postcode 1301 corresponds to the street “Landgreven”. According to www.krak.dk, 17 persons have an adress there. There might be a bit more. They only register people with a phonenumber. And leaves out people with an unlisted number. But let us assume that there are only those 17 persons. 8 of them are women. So if we have health data on medical procedures – broken down by postcode and gender, we might be able to say that one of 8 named women had an abortion. Not that there is much stigma associated with that in Denmark, or at least there shouldnt be. But it is still something you probably would like to keep to yourself.

Twitter, Flickr and graphs

Some people like to be anonomous on Twitter. Looking at the name-calling, flamewars and general pouring crap over people you disagree with on Twitter, it is surprising that not more people are trying to be anonymous on Twitter. But some people have legitimate reasons to try to be anonymous. Whistleblowers, human rigts activists etc.

Social media are characterized by graphs. Not pie charts and such. But networks. Each person is a node, and each connection, following eg, between nodes is an edge. The network defined by nodes and edges is called a graph. Two researchers Narayanan and Shmatikov have made an interesting study, “De-anonymizing social networks”. Take a lot of persons that have accounts on both Twitter and Flickr. Anonymise the Twitter accounts. One third of those Twitter accounts can be linked to the Flickr acocunt of the same person. In spite of the anonymisation.

How? Well, the graph describing who you follow and who follows you on Twitter, will share characteristics with the graph on Flickr. And those graphs are pretty unique. Read more here.