Ethics in AI – two short notes

We are introducing AI in our search systems at the library. Nothing new if you ask me, we have had it for years, we just called it machine learning.

That leads to colleagues advocating for us looking into the ethics of AI. Again – that should not be something new. And frankly I am not quite sure that it is that relevant in a library setting. At least as we are actually going to use AI.

Anyway, two short notes – so I have them somewhere and, wont forget them:

  1. The ML algorithms are working on the statistics we feed them. Not the reality we would like to have. That is note: Both the difference between the statistics we feed to the algorithms and reality. And the difference between the reality we actually lives in, and the reality we would like to live in.
  2. The algorithms reflects (with the filter between reality and statistics mentioned above) the reality of yesterday and, with some probability, the reality of today.

When are you actually an expert in Excel?

I’m not going to proclaim my normative answer to that question.

I will, however, note that a lot of people claiming to be superusers, are not actually that super. As in: I reject your claim to superuser status in Excel, if you sum a column of 1s in order to count how many rows your data encompas.

This site tries to provide a classification. It is made by Aaron Blood. He is a “Microsoft Most Valuable Professional”. I have worked with Excel on and off for the last 25 years. I have only just moved from “Intermediate” to “Advanced”.

Personally I think this scale is not granular enough. The step between “Intermediate” og “Advanced” is rather large. But I think the scale is useful because it illustrates what Excel actually can. The average “superuser” do not even know that macros is an option. And that makes it entirely reasonable for them to claim that they are superusers. They know everything that they know there is to know. They think that what they know is the entirety of what Excel can. They have no idea how much more there is to know.

Hvad skal vi kunne – og hvordan når vi derhen?

Bibliotekarer skal være dataanalytikere. Det er i hvert fald trenden, og det er sikkert også en god ide. Der er naturlige grænser for hvor gode vi bliver til det. Der er i hvert fald grænser for hvor god jeg bliver til det.

Hvad er det så vi skal kunne?

Vi skal kunne komme gennem disse fem trin:

  1. Høstning af data.
  2. Manipulation af data til de er teknisk korrekte
  3. Manipulation af data til de er konsistente
  4. Analyse af data
  5. Præsentation af data

I punkt 1 skal vi kunne finde ud af at indsamle data. Både fra hjemmesider, databaser og virkeligheden.

Punkt 2. handler om at de data vi får ind, ofte ikke er teknisk korrekte. Det handler langt hen ad vejen om at vi skal kunne finde ud af at sørge for, at de data vi har samlet ind, ligger i de rigtige datatyper. Ligger tallet “123” som et tekstfelt i et regneark, kan vi ikke gange det med 10. Hvis det ligger som et tal, kan vi. Hvis køn i dele af materialet er kodet som M/K, og i andre dele af datamaterialet som M/F, eller male/female, skal det også rettes til.

Punkt 3 kan beskrives som at data skal være korrekte, og konsistente. Vi skal kunne finde, og håndtere manglende data. Vi skal kunne finde og håndtere specielle værdier. “Uendelig” er et eksempel. Der kan også være outliers i data. Og der kan være data der er inkonsistent. I samfundsvidenskaberne kan det være spørgeskemaer, hvor respondenten har svaret at han er 40 år gammel, og været gift i 45 år. Det hænger ikke sammen, det skal vi kunne fange.

I punkt 4 skal vi kunne analysere data. Vi skal ikke være statistiske eksperter. Vi skal ikke have styr på alverdens statistiske metoder. Men vi skal kunne et minimum. En simpel korrellation skal vi kunne håndtere.

Og så skal vi endelig kunne præsentere data. Det skal ikke være en uoverstigelig udfordring for os at lave en simpel graf af noget data. Vi behøver ikke kunne kaste os ud i netværksanalyser og gode måder at plotte fem-dimensionelle data. Men et lagkagediagram må ikke være os fremmed.

Hvordan når vi derhen.

Det ved jeg ikke. Men der er tre grundlæggende kompetencer der skal være på plads:

  1. Et minimum af forståelse for programmering. Datatyper, datastrukturer, logiske strukturer i programmer.
  2. Et minimum af forståelse for datamanipulation. Vi taler om at man skal kunne en smule mere end sit fadervor i Excel.
  3. Seriøse evner i Google, og gå-på-mod. Man skal være parat til at finde løsninger på egen hånd, og man må ikke være bange for at prøve.

Så inden vi går igang: Se at komme igennem et Excel-kursus. Tyg dig gennem et af de aldeles udemærkede, og gratis, online kurser i programmering. Jeg vil anbefale Python, der åbenbart er stort blandt dataanalytikere. Og stil dig foran et spejl og gentag disse tre mantraer, til du tror på dem:

“Hvis jeg har et problem, kan jeg finde en løsning på nettet.”
“Det gør ikke noget at jeg prøver om det virker, jeg kan altid se om jeg kan finde en anden løsning i stedet”
“Jeg kan ikke ødelægge computeren af at prøve”

De holder ikke 100% i virkeligheden. Men du kommer meget langt hvis du prøver.

Legitimation vs identifikation

Det meste af denne blog er på engelsk. Jeg arbejder i et internationalt felt. Men her kommer mit engelske lidt til kort – i hvert fald på skrift.

Vi har et problem med CPR-numre. Og det skyldes en grundlæggende misforståelse.

CPR-nummeret er den unikke identifikation af en person i Danmark. De ti cifre der er knyttet til mig, identificerer mig. Når vi taler om en person med CPR-nummeret 123456-7890, er der en og kun en person der kan være tale om.

Vi ved også at det er en person født i den ukendte måned 34 og at vedkommende er født med en biologi der gør vedkommende til en “han”.

Så når jeg går ind i en forretning, og indgår en låneaftale, hvoraf det fremgår at CPR-nummeret på låntager er 123456-7890 – så er det entydigt identificeret, at det er denne meget fiktive person der hæfter for pengene.

Betyder det så at jeg, den person der stod i forretningen, var denne person?

Nej. Jeg kan have stjålet et sygesikringsbevis. Jeg kan for så vidt have produceret det selv. Det er ikke så svært. I de mest grelle eksempler, har låntager slet ikke fremvist noget som helst, men bare sagt CPR-nummeret. Hvis jeg ikke har fremvist legitimation på at jeg er den person der er identificeret med det CPR-nummer – så er der ingen garanti for at jeg faktisk er den person.

Problemet opstår fordi folk ikke kan finde ud af at skelne mellem netop identifikation og legitimation. CPR-nummeret bliver betragtet som en hemmelig kode. Hvis man kan oplyse det, så må man være den person. Men det er ikke spor hemmeligt. Det står på dit sygesikringsbevis. Det gemmer sig i stregkoden. De første seks cifre oplyser du gladeligt til hvem som helst. Det sidste ciffer kan, for 99,4% af befolkningen, begrænses til at være et af fem.

Du tror det er hemmeligt, og at hvis du kan oplyse det, så kan der ikke være tvivl om at det faktisk er dig. Udbydere af kviklån lader som om det er hemmeligt, og at de har fat i den rigtige person hvis de får det oplyst. Jeg er helt sikker på at de godt er klar over problemstillingen, men bare er ligeglade. Alle lader som om at man kan bruge CPR-nummeret til at legitimere sig. Og fordi vi lader som om man kan det, får man problemer med identitetstyveri når folk kan optage forbrugslån ved blot at oplyse et CPR-nummer. Og folk går i panik over at deres CPR-nummer offentliggøres.

Vi burde offentliggøre alle CPR-numre. Så ville det forhåbentlig stå klart for alle at CPR-nummeret ikke legitimerer noget som helst.

Working from home – it’s not all bad

As a follow up to the previous post.

Yes, a sub-optimal meeting culture does not automatically become optimal because it goes online.

But! This corona/covid-19/chinese flu crisis amplifies stuff. The racial problems in the US becomes more pronounced. The problems with the british class system are suddenly visible even from across the north sea. The challenges workers in precarious posions in a global economy faces are enlarged.

And the problematic elements in a meeting culture becomes large enough, that something might actually be done about them!

Working from home

Denmark has been in lockdown – or something like it – for close to two months now. March 11th our prime minister announced that all public employees where to be sent home as soon as possible. A lot of other institutions, companies and functions were similarly closed down. Further restrictions were introduced in the following weeks.

My husband was also told to work from home. He had his first full day at home on the 12th. I had to show up at work on the 12th, and was sent home early because I had an online meeting in the afternoon.

How is this working? The first 1½ week it was great! Finally I had the time to get on top of things, uninterrupted by meetings, colleagues asking if I wanted to drink coffee, or if I could help change the toner in the printer.

The next two weeks – (timeline excluding the 1½ week I was on mandatory vacation at home in connection with easter) – where not that productive. The stuff that could be done without talking with colleagues was running out.

And after that – now we have found the way. Online meetings are working, debating the way we should design a general, introductory workshop on visualization of data is as frustrating as it would have been otherwise. And the amount of irrelevant meetings is back to normal. Do not misunderstand me – meeting with colleagues without having an agenda is nice. And necessary, especially since we are in almost complete isolation at home. On the other hand. Not every meeting is actually necessary. I started the lockdown joking that we would now discover which meetings could have been emails. Now, I think that we are beginning to discover which emails could have been meetings. If we cant meet, at least we can have a meeting.

In other words – the bad habits of the physical office is entering the online office.

Managers and colleagues have no problem booking online meetings at the same time as you are in another meeting. With little to no understanding that you are actually in another meeting. As I wrote my boss: I can’t wait for us to get physically back to the library, so I can tell you that I cannot attend that meeting, since I am out of town at another meeting.

Usually it is possible to get people to understand, that if your meeting at location A ends at 12.00 you are not able to attend another meeting at location B. Or, they can understand if if there is 30 minutes of transportation between the two locations. Not so in the virtual world. Leading to a day (two weeks ago as of this writing), with the first meeting beginning at 9 and ending at 9.30. The next meeting from 9.30 to 11.00. The next again from 11.00 to 12.00. And the next from 12.00 to 13.00. The fifth meeting started 13.00 and was finished 14.00 Not that we were actually finished, but the sixth meeting was at 14.00. Happily it was a short one, only lasting 15 minutes. So I had 45 minutes for a biobreak, before the final meeting started at 15.00.

 

Errorbars on barcharts

Errorbars

An errorbar is a graphical indication of the standard deviation of a measurement. Or a confidence interval.

The mean of a measurement is something. We want to illustrate how confident we are of that value.

How do we plot that?

I am going to use three libraries:

library(ggplot2)
library(dplyr)
library(tibble)

ggplot2 for plotting, dplyr to make manipulating data easier. And tibble to support a single line later.

Then, we need some data.

data <- data.frame(
  name=letters[1:3],
  mean=sample(seq(4,15),3),
  sd=c(1,0.2,3)
)

I now have a dataframe with three variables, name, mean and standard deviation (sd).

How do we plot that?

Simple:

data %>% 
  ggplot() +
  geom_bar( aes(x=name, y=mean), stat="identity", fill="skyblue") +
  geom_errorbar(aes(x=name, ymin=mean-sd, ymax = mean+sd), width=0.1)

plot of chunk unnamed-chunk-12

So. What on earth was that?

The pipe-operator %>% takes whatever is on the left-hand-side, and inserts it as the first variable in whatever is on the right-hand-side.

What is on the right-hand-side is the ggplot function. That would normally have a datat=something as the first variable. Here that is the data we constructed earlier.

To that initial plot, which is completely empty, we add a geom_bar. That plots the bars on the plot. It takes an x-value, name, and a y-value, mean. And we tell the function, that rather than counting the number of observations of each x-value (the default behavior of geom_bar), it should use the y-values provided. We also want a nice lightblue color for the bars.

To that bar-chart, we now add errorbars. geom_errorbar needs to know the x- and y-values of the bars, in order to place them correctly. It also needs to know where to place the upper errorbar, and the lower errorbar. And we supply the information that ymin, the lower, should be the mean value minus the standard deviation. And the upper bar, ymax, the sum of the mean and sd. Finally we need to decide how broad those lines should be. We do that by writing “width=0.1”. We do not actually have to, but the default value results in an ugly plot.

And there you go, a barchart with errorbars!

Next step

That was all very nice. However! We do not usually have a nice dataframe with means and standard deviations calculated directly. More often, we have a dataframe like this:

mtcars %>% 
  remove_rownames() %>% 
  select(cyl, disp) %>% 
  head()
##   cyl disp
## 1   6  160
## 2   6  160
## 3   4  108
## 4   6  258
## 5   8  360
## 6   6  225

I’ll get back to what the code actually means later.

Here we have 32 observations (only 6 shown above), of the variables “cyl” and “disp”. I would now like to make a barplot of the mean value of disp for each of the three different values or groups in cyl (4,6 and 8). And add the errorbars.

You could scroll through all the data, sort them by cyl, manually count the number of observations in each group, add the disp, divide etc etc etc.

But there is a simpler way:

mtcars %>% 
  remove_rownames() %>% 
  select(cyl, disp) %>% 
  group_by(cyl) %>% 
  summarise(mean=mean(disp), sd=sd(disp))
## # A tibble: 3 x 3
##     cyl  mean    sd
##   <dbl> <dbl> <dbl>
## 1     4  105.  26.9
## 2     6  183.  41.6
## 3     8  353.  67.8

mtcars is a build-in dataset (cars in the us in 1974). I send that, using the pipe-operator, to the function remove_rownames, that does exactly that. We don’t need them, and they will just confuse us. That result is then send to the function select, that selects the two columns/variables cyl and disp, and discards the rest. Next, we group the data according to the value of cyl. There are three different values, 4, 6 and 8. And then we use the summarise function, to calculate the mean and the standard deviation of disp, for each of the three groups.

Now we should be ready to plot. We just send the result above to the plot function from before:

mtcars %>% 
  remove_rownames() %>% 
  select(cyl, disp) %>% 
  group_by(cyl) %>% 
  summarise(mean=mean(disp), sd=sd(disp)) %>% 
  ggplot() +
    geom_bar( aes(x=cyl, y=mean), stat="identity", fill="skyblue") +
    geom_errorbar(aes(x=cyl, ymin=mean-sd, ymax = mean+sd), width=0.1)

plot of chunk unnamed-chunk-15
All we need to remember is to change “name” in the original to “cyl”. All done!

But wait! There is more!!

Those errorbars can be shown in more than one way.

Let us start by saving our means and sds in a dataframe:

data <- mtcars %>% 
  remove_rownames() %>% 
  select(cyl, disp) %>% 
  group_by(cyl) %>% 
  summarise(mean=mean(disp), sd=sd(disp))

geom_crossbar results in this:

data %>% 
ggplot() +
  geom_bar( aes(x=cyl, y=mean), stat="identity", fill="skyblue") +
  geom_crossbar( aes(x=cyl, y=mean, ymin=mean-sd, ymax=mean+sd))

plot of chunk unnamed-chunk-17

I think it is ugly. But whatever floats your boat.

Then there is just a vertival bar, geom_linerange. I think it makes it a bit more difficult to compare the errorbars. On the other hand, it results in a plot that is a bit more clean:

data %>% ggplot() +
  geom_bar( aes(x=cyl, y=mean), stat="identity", fill="skyblue") +
  geom_linerange( aes(x=cyl, ymin=mean-sd, ymax=mean+sd))

plot of chunk unnamed-chunk-18

And here is geom_pointrange. The mean is shown as a point. This probably works best without the bars.

data %>% ggplot() +
  geom_bar( aes(x=cyl, y=mean), stat="identity", fill="skyblue", alpha=0.5) +
  geom_pointrange( aes(x=cyl, y=mean, ymin=mean-sd, ymax=mean+sd))

plot of chunk unnamed-chunk-19

Project Euler 5 – Smallest multiple

What is the smallest, positive, number that can be divided by all numbers from 1 to 20 without any remainder?

We are given that 2520 is the smallest that can be divided by all numbers from 1:10.

One number that can definitely be divided by all numbers from 1:20 is:

factorial(20)
## [1] 2.432902e+18

But given that

factorial(10)
## [1] 3628800

is rather larger than 2520, it is definitely not the answer.

The answer must be a multiple of all the primes smaller than 20. A number that is divisible by 15, will be divisible by
3 and 5.

The library “numbers” have a lot of useful functions. Primes(20) returns all primes smaller than 20, and prod() returns the product of all those primes

library(numbers)
prod(Primes(20))
## [1] 9699690

Could that be the answer?

What we are looking at is the modulo-operator. 9699690 modulo 2 – what is the remainder? We know that all the remainders, dividing by 1 to 20 must be 0.

prod(Primes(20)) %% 2
## [1] 0

And our large product is divisible by 2 without a remainder.

Thankfully the operator is vectorized, so we can do all the divisions in one go:

9699690 %% 1:20
##  [1]  0  0  0  2  0  0  0  2  3  0  0  6  0  0  0 10  0 12  0 10

Nope.

9699690 %% 4
## [1] 2

Leaves a remainder.

(2*9699690) %% 4
## [1] 0

Now I just need to find the number to multiply 9699690 with, in order for all the divisions to have a remainder of 0.
That is, change i in this code until the answer is true.

i <- 2
all((i*9699690) %% 1:20 == 0)
## [1] FALSE

Starting with 1*9699690, I test if all the remainders of the divisions by all numbers from 1 to 20 is zero.
As long as they are not, I increase i by 1, save i*9699690 as the answer, and test again.
If the test is TRUE, that is all the remainders are 0, the while-loop quits, and I have the answer.

i <- 1
while(!all((i*9699690) %% 1:20 == 0)){
 i <- i + 1
 answer <- i*9699690
}

Weird behavior of is.numeric()

An interesting observation. Please note that I have absolutely no idea about why this happens. (at least not at the time of first writing this)

In our datalab, ingeniuously named “Datalab” (the one at KUB-North, because contrary to the labs at the libraries for social sciences, and for the humanities, we are not allowed to have a name), we were visited by at student.

She wanted to make a dose-response plot. Something about the concentration of something, giving some clotting of some blood. Or something…

Anyway, she wanted to do a 4-parameter logistic model. Thats nice, I had never heard about that before, but that is the adventure of running a datalab, and what makes it fun.

Of course there is a package for it, dr4pl. After an introduction to the wonderful world of dplyr, we set out to actually fit the model. This is a minimal working example of what happened:

library(tidyverse)
library(dr4pl)

data <- tibble(dose= 1:10, response = 2:11)

dr4pl(data, dose, response)
## Error in dr4pl.default(dose = dose, response = response, init.parm = init.parm, : Both doses and responses should be numeric.

WTF? “Both doses and responses should be numeric”?. But they are!

is.numeric(data$dose)
## [1] TRUE
is.numeric(data$response)
## [1] TRUE

The error is thrown by these lines in the source:

if(!is.numeric(dose)||!is.numeric(response)) {
    stop("Both doses and responses should be numeric.")
  }

Lets try:

if(!is.numeric(data$dose)||!is.numeric(data$response)) {
    stop("Both doses and responses should be numeric.")
  } else {
    print("Where did the problem go?")
  }
## [1] "Where did the problem go?"

No idea. Did it disappear? No:

dr4pl(data, dose, response)
## Error in dr4pl.default(dose = dose, response = response, init.parm = init.parm, : Both doses and responses should be numeric.

Looking at the data might give us an idea of the source:

str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##  $ dose    : int  1 2 3 4 5 6 7 8 9 10
##  $ response: int  2 3 4 5 6 7 8 9 10 11

Both dose and response are integers. Might that be the problem?

data <- tibble(dose= (1:10)*1.1, response = (2:11)*1.1)
dr4pl(data, dose, response)
## Error in dr4pl.default(dose = dose, response = response, init.parm = init.parm, : Both doses and responses should be numeric.

Nope. Both dose and response are now definitely numeric:

str(data)
## Classes 'tbl_df', 'tbl' and 'data.frame':    10 obs. of  2 variables:
##  $ dose    : num  1.1 2.2 3.3 4.4 5.5 6.6 7.7 8.8 9.9 11
##  $ response: num  2.2 3.3 4.4 5.5 6.6 7.7 8.8 9.9 11 12.1

But the problem persists.

Might this be the reason?

head(data,2)
## # A tibble: 2 x 2
##    dose response
##   <dbl>    <dbl>
## 1   1.1      2.2
## 2   2.2      3.3

It is a tibble. And the variables are reported to be doubles.

But according to the documentation:

“numeric is identical to double (and real)”

That should not be a problem then.

However. In desperation, I tried this:

c <- data %>% 
  as.data.frame() %>% 
  dr4pl(dose, response)

And it works!

Why? Absolutely no idea!

Or do i?

The problem is that subsetting a tibble returns a new tibble. Well, subsetting a dataframe returns a dataframe as well?

It does. Unless you subset out a single variable or observation.

In a dataframe, df$var returns a vektor, containing the values of var in df.

If, however, df is a tibble, df$var will return a tibble, with just one variable.