Those civil war monuments in the US

I once saw a video suggesting that the US should choose Canada as their next president. They would solve the race-problem. As soon as they figured out why there still was a race-problem. It’s one of those things that are pretty hard to understand for europeans. And apparently for canadians as well.

Anyway, there is a problem, and lately it’s been manifesting in protests against monuments celebrating confederate war-“heroes”. Thats another thing that is difficult to understand from across the atlantic. Those monuments must certainly have been standing there for a very long time. After all, the war ended in 1865. They may be celebrating the losers, but seriously, what’s the problem? Well… Actually there is a problem. Most of the monuments were not erected to commemorate fallen soldiers just after the war. They were erected to show those n******, who was still in charge when they started to organize for rights. And again when those rights were granted (to some degree). There’s a lot more information in this report. And suddenly even europeans begin to understand what all the fuss is about. But still. Seriously? Get over it, you lost, be nice, and treat people like, you know, Jesus said you should.

Okay – that was the introduction. And now for the data. Somewhere in there, there is an interesting datavisualization. The data is here:

https://data.world/datadanlarson/confederatemonument/workspace/file?filename=CivilWarMamorials.csv

What do I want to do with it? I want an animated GIF. Each frame in that gif should represent a year, and show a map of the US, with all the monuments erected up to and including that year. Preferably it should be noticable what monuments where erected that year. A graph showing the development of the number of monuments would be nice as well.

Okay, lets get coding. The data comes from data.world, and they have their own R-package:

  devtools::install_github("datadotworld/data.world-r", build_vignettes = TRUE, force = TRUE)

In order to get to the data, You will need a token. There is excellent help to get at data.world.

data.world::set_config(data.world::save_config(auth_token = "redacted"))

It is a horrible token. Anyway, lets plod on. Load their library, and get the data

library(data.world)
dataset_key <- "https://data.world/datadanlarson/confederatemonument"
tables_qry <- data.world::qry_sql("SELECT * FROM Tables")
tables_df <- data.world::query(tables_qry, dataset = dataset_key)
sample_qry <- data.world::qry_sql(sprintf("SELECT * FROM `%s`", tables_df$tableName[[1]]))
sample_df <- data.world::query(sample_qry, dataset = dataset_key)

This is a simple copy-paste from the examples at data.world. The net result is, that we have all the data in the dataframe sample_df.

We are interested in the year a monument was erected. Not all years are known. And this is of course a serious flaw in the following visualization. Patterns may not be what they appear, when approx. half the data is missing.

Anyway, lets get rid of the rows with missing data:

newdata <- sample_df[complete.cases(sample_df),]

complete.cases returns a logical vector. True if the row is free from NAs, eg. is complete, False if there is an NA in it. newdata is now a dataframe with only the complete cases.

We need coordinates. Google is our friend, however, I’m not going to run that geocoding again. Google allows 2500 calls to their API every day from a given IP-number. It is easy to run out of calls.

library(ggmap)
newerdata <- data.frame(city=character(), state = character(), year = numeric(), status = character(), lat = numeric(), lon = numeric(), stringsAsFactors = FALSE)
for (row in 1:nrow(newdata)){
result <- geocode(paste(newdata[row,2]$city, newdata[row,1]$state, sep=", "), output="latlon", source="google")

    newerdata[nrow(newerdata) + 1,] = c(newdata[row,2]$city,newdata[row,1]$state, as.numeric(newdata[row,5]$year), newdata[row,6]$civilwarstatus, result$lat, result$lon)
  
}
save(newerdata, file="newestdata.rda")

What happens? We call ggmap, which provides the function geocode. A new dataframe, newerdata, is defined. For each row in newdata, I call geogode on the city and state, separated with a “,”, and saves the result in “result”. And then I add the rows I want to the newerdata dataframe. There is probably a better way to do that. But it works. Here there is another thing that should probably be handled. Sometimes Google is not able to determine exactly what location we give it. There may be more than one place in a US state with the same name. I’m just taking the first result google gives me. A bit sloppy, I know. make some plots.

Finally I save the data. Now we should be ready to We are going to need some libraries for that:

library(grid)
library(gganimate)
library(animation)
library(ggplot2)
require(cowplot)

I’ll just make absolutely sure that the data is in the form I need it to be:

load("newestdata.rda")
newestdata <- newestdata[complete.cases(newestdata),]
newestdata <- na.omit(newestdata)
newestdata$year <- as.numeric(newestdata$year)
newestdata$lat <- as.numeric(newestdata$lat)
newestdata$lon <- as.numeric(newestdata$lon)

The coordinates and the year should be numeric. Any rows with missing values should be gone forever.

I get a map to plot on:

us <- get_map("USA", zoom = 4)

And, lets make the first plot:

g <-ggmap(us) +
  geom_point(aes(x=lon, y=lat), data=newestdata, color="red")
g

Theres a lot of red there. I would like to make an animation. The standard way to do it here, or at least the way I usually do it, would be to define a function, that plots what I want to plot as a function of the year. Like this:

singleyear <- function(ye){
  g <-ggmap(us) +
  geom_point(aes(x=lon, y=lat), data=newestdata[which(newestdata$year<(ye)),], color="red")
 g
}

Now, when I make this function call, I should get a map with all the monuments erected before 1890:

singleyear(1890)

I’m missing the monuments erected in 1890. That is because I would like those to be a different color. Lets add to the function, and call it again:

 

 

 

singleyear <- function(ye){
  g <-ggmap(us) +
  geom_point(aes(x=lon, y=lat), data=newestdata[which(newestdata$year<ye),], color="red")+
  geom_point(aes(x=lon, y=lat), data=newestdata[which(newestdata$year==ye),], color="blue")
 g
}
singleyear(1890)

Now, when I plot the map for a given year, all monuments from that year will appear as blue. And for the next year they’ll turn red.

 

 

Lets back up a bit here.

What I’m doing is this: I take the map I retrieved from Google, and plots it with the ggmap function. Then I add points with the geom_point function. I tell the function that the data is “newestdata”, and that the points should be placed at position x,y where x is the longitude, and y the latitude. The color should be red. I am however also telling that the data should be the part of newestdata, where year is smaller than the ye-value I provide to the function. In the next line, I add the same point, just for the part of the data where year is equal to the ye-value i provide to the function. And that the color should be blue.

What I am going to do later, is to call this function for all years from 1860 to 2017, and show those plots one after another. We will get at small movie, where new monuments turn up as blue spots. And then turn red in the next frame.

Everything becomes very red. One way to do something about this, would be to make the dots transparent. Let them fade out. When the monument is just erected, let the dot be blue. The year after, let it be a transparent red dot, the year after that, make it more transparent. Areas with a lot of monuments will still be pretty red, but the new monuments will be more visible. We’ll get at sort of heatmap, where the very red areas are places with at lot of monuments, and the not so red areas have fewer.

I’ll need to have a maximum and a minimum for the transparency. Defined as af function of the year I am plotting, and the year a monument was erected. I’m gonna add it to the dataframe before I plot it. This is the line:

maxp <- 0.8
minp <- 0.3
newestdata$alpha <- (maxp - minp)/(ye-1861)*(as.numeric(newestdata$year)-ye)+maxp

A new column, alpha, is added to the dataframe, and the difference between the year I am plotting, and the year of the row is normalised to the range 0.3 to 0.8. The point begins as blue, turns red with a transparency of 0.8, and will fade to 0.3.

I’m not sure the range is perfect. I’m gonna go with it for now.

Lets add it to the function:

maxp <- 0.8
minp <- 0.3
singleyear <- function(ye){
  newestdata$alpha <- (maxp - minp)/(ye-1861)*(as.numeric(newestdata$year)-ye)+maxp
  g <-ggmap(us) +
    geom_point(aes(x=lon, y=lat), data=newestdata[which(newestdata$year<ye),], alpha=newestdata[which(newestdata$year<ye),]$alpha, color="red")+ 
  geom_point(aes(x=lon, y=lat), data=newestdata[which(newestdata$year==ye),], alpha=1, color="blue")
 g
}
singleyear(1890)

 

What I also added, was this: “alpha=newestdata[which(newestdata$year<(aar)),]$alpha”. Alpha is the transparency. So each point is plottet with the transparency I calculated.

What next? Lets adjust the plot a bit:

maxp <- 0.8
minp <- 0.3
singleyear <- function(ye){
  newestdata$alpha <- (maxp - minp)/(ye-1861)*(as.numeric(newestdata$year)-ye)+maxp
  g <-ggmap(us) +
    geom_point(aes(x=lon, y=lat), data=newestdata[which(newestdata$year<ye),], alpha=newestdata[which(newestdata$year<ye),]$alpha, color="red")+ 
    geom_point(aes(x=lon, y=lat), data=newestdata[which(newestdata$year==ye),], alpha=1, color="blue") +
    theme(plot.title=element_text(hjust=0)) +
    theme(axis.ticks = element_blank(),
        axis.text = element_blank(),
        panel.border = element_blank()) +
    labs(title="Confederate memorials", subtitle=ye, x=" ", y="")
 g
}
singleyear(1890)

 

Labels are left aligned (hjust=0), tickmarks and text is removed (all the element_blank parts). And a title “Confederate memorials” is added, with a subtitle indicating the year we have reached in the plot.

Nice. What more? The interesting part, at least I think it is interesting, is the fact that the number of these memorials increase at certain times. That can be seen on the map. But a line-plot would probably make it more visible. It would also be nice to have it side-by-side with the map. How to do that?

To begin with, I’ll need a set of data with the cumulative count of monuments:

linedata <- as.data.frame(table(newestdata$year), stringsAsFactors = FALSE)

colnames(linedata) <- c("year", "cumsum")
linedata$year <- as.numeric(linedata$year)
linedata$cumsum <- cumsum(linedata$cumsum)

I make a new dataframe, linedata. The content is table(newestdata$year). The table function summarizes the data in newestdata. It gives me a table with all the years, and the number those years occur. Eg. that there are 7 occurences of the year 1870. That corresponds to 7 monuments erected in 1870. I save that as a dataframe. Then I change the names of the colums, and make sure that the years are saved as numeric. And then I call cumsum. That is a function that calculates the cumulative sum. Ie in 1861 2 monuments where erected. None where erected before that. The cumulative sum of all monuments until 1861 was 2. In 1862 3 monuments were erected. The cumulative sum for 1862 is 5. The sum of the 3 monuments erected that year, and all the monuments erected before that.

That in itself is an interesting plot:

ye = 2017
h <- ggplot(linedata[which(linedata$year<=ye),]) +
  geom_line(aes(x=year, y=cumsum))
h

 

Something happens in 1910 and again in 1950. Or somewhere close to those years. At least that is where the graph changes shape.

Lets adjust it a bit:

ye=2017
h <- ggplot(linedata[which(linedata$year<=ye),]) +
  geom_line(aes(x=year, y=cumsum))+
  xlim(1860,2017) +
  ylim(0,850) +
  ylab("Number of confederate memorials") +
  xlab("")
h

Nothing fancy, just freezing the axes, adding a label for the y-axis, and removing it from the x-axis.

Now I can add it to the original plot. I loaded the library cowplot earlier. That gives us the function plot_grid.

 

i <- plot_grid(g,h,align='h')
i

I have two plots, g and h, and combine them in a newplot, horizontally (thats the h in align), to a new plot i. And then I plot i.

 

 

Lets add that to the function:

singleyear <- function(ye){
  newestdata$alpha <- (maxp - minp)/(ye-1861)*(as.numeric(newestdata$year)-ye)+maxp
  g <-ggmap(us) +
    geom_point(aes(x=lon, y=lat), data=newestdata[which(newestdata$year<ye),], alpha=newestdata[which(newestdata$year<ye),]$alpha, color="red")+ 
    geom_point(aes(x=lon, y=lat), data=newestdata[which(newestdata$year==ye),], alpha=1, color="blue") +
    theme(plot.title=element_text(hjust=0)) +
    theme(axis.ticks = element_blank(),
        axis.text = element_blank(),
        panel.border = element_blank()) +
    labs(title="Confederate memorials", subtitle=ye, x=" ", y="")
  h <- ggplot(linedata[which(linedata$year<=ye),]) +
    geom_line(aes(x=year, y=cumsum))+
    xlim(1860,2017) +
    ylim(0,850) +
    ylab("Number of confederate memorials") +
    xlab("")
  i <- plot_grid(g,h,align='h')
  i
}
singleyear(1890)

Now I’m getting there! Almost ready for the animation. I would like some annotation on the h-plot. A couple of arrows. And I would like them to show up in the animation. As in, from year 1910 there should be text at a certain place. But not before. It’s not that difficult.

If I add this:

if(ye>1908){

h <- h + annotate(“text”, x =1980, y = 200, label=“NAACP established”) + geom_segment(aes(x=1950, y = 200, xend=1909, yend = 200), size=1, arrow=arrow(length=unit(0.5, “cm”))) }

to the function, every plot, for a year after 1908, will have an annotation on the h-plot, at position 1980,200, with the text “NAACP established”. The next line, geom_segment, will add an arrow, with a size defined by the size and length parameters, beginning at 1950,200 and ending at 1909,200.

It took quite a bit of time to fiddle with those positions!

That is not the only annotation I want. So the complete function is as follows:

singleyear <- function(ye){
  newestdata$alpha <- (maxp - minp)/(ye-1861)*(as.numeric(newestdata$year)-ye)+maxp
  g <-ggmap(us) +
    geom_point(aes(x=lon, y=lat), data=newestdata[which(newestdata$year<ye),], alpha=newestdata[which(newestdata$year<ye),]$alpha, color="red")+ 
    geom_point(aes(x=lon, y=lat), data=newestdata[which(newestdata$year==ye),], alpha=1, color="blue") +
    theme(plot.title=element_text(hjust=0)) +
    theme(axis.ticks = element_blank(),
        axis.text = element_blank(),
        panel.border = element_blank()) +
    labs(title="Confederate memorials", subtitle=ye, x=" ", y="")
  h <- ggplot(linedata[which(linedata$year<=ye),]) +
    geom_line(aes(x=year, y=cumsum))+
    xlim(1860,2017) +
    ylim(0,850) +
    ylab("Number of confederate memorials") +
    xlab("")
  
  if(ye>1908){
    h <- h + annotate("text", x =1980, y = 200, label="NAACP established") +
       geom_segment(aes(x=1950, y = 200, xend=1909, yend = 200), size=1, arrow=arrow(length=unit(0.5, "cm")))
  }

  if(ye>1910){
    h <- h + annotate("text", x = 1870, y = 500, label="Coincidence?") +
      geom_segment(aes(x=1870, y=490, xend=1905, yend=240), size=1, arrow=arrow(length=unit(0.5, "cm")))
  }

  if(ye>1955){
    h <- h + annotate("text", x =1960, y = 400, label="Beginning of civil rights movement") +
      geom_segment(aes(x=1950, y = 450, xend=1955, yend = 640), size=1, arrow=arrow(length=unit(0.5, "cm")))
  }

  if(ye>1960){
    h <- h + annotate("text", x= 1866, y=515, label="Another") +
      geom_segment(aes(x=1882, y=500, xend=1955, yend=680), size=1, arrow=arrow(length=unit(0.5, "cm")))
  }

  i <- plot_grid(g,h,align='h')
  print(i)
}
singleyear(2017)
## Warning: Removed 1 rows containing missing values (geom_point).

OK. It looks like hell. But! In just a moment, it will look. Well, not exactly perfect, but at least much better.

The way to animate is this: Define a function that gives you the frames you want, as a function of an iterable. That’s already done. Call this:

saveGIF({
  for (ye in 1860:2017){
   singleyear(ye)
  }
}
, interval=0.3, ani.width=1280, ani.height=720)

Enjoy. Oh, and note, that it will crash if you try to animate it a notebook.

Der findes ikke problemer

Kun udfordringer.

Det har jeg i hvert fald ladet mig fortælle. Og det handler jo om at man skal italesætte tingene på den rigtige måde. Problemer er negativt ladede. Måske endda så meget, at de ikke kan løses. Så det ord må vi ikke bruge. I stedet skal vi bruge ordet udfordring, der ikke er negativt ladet. Ordet udfordring, i stedet for udfordring, skal lissom signalere, at det er noget vi løser. Vi skal bare tage os sammen, så viser det sig at udfordringen slet ikke er en udfordring, men i stedet en udfordring.

Jeg tror bare der er en udfordring med at sige udfordring i stedet for udfordring. Udfordringen består i at vi gør sproget upræcist. Vi skaber en udfordring, når vi mener at skulle bruge et andet ord end udfordring, fordi udfordring er negativt. Du skal ikke komme med udfordringer, du skal komme med løsninger. Ja, selvfølgelig, men hvis jeg havde løsningen på den her udfordring, så ville jeg ikke have behov for at komme med den til dig. Så var det jo ikke en udfordring længere. Det kunne være at der så var en udfordring ved at gennemføre løsningen på udfordringen. Men det ville jo være en anden udfordring.

Udfordringen ved ikke at måtte kalde udfordringer for udfordringer er dobbelt.

Dels er der udfordringen ved at skulle kalde udfordringer for udfordringer i stedet for udfordringer. Det mudrer sproget. Hvis du har læst hertil, vil du vide hvad jeg mener. Det er faktisk udfordrende at skrive udfordring i stedet for udfordring hver gang, og det gør sikkert også at man er ret udfordret når man skal læse teksten. Det spænder ben for udfordringsløsningen hvis man ikke må kalde en udfordring for en udfordring. Det fjerner fokus på, at når vi løser denne udfordring, så er det faktisk fordi det er en udfordring, som er negativ, og derfor skal løses. Vi forfladiger udfordringsstillingen, og dermed bliver det også udfordrende at gennemskue hvori udfordringen egentlig består. Er det overhovedet en udfordring? Er vi sikre på at det ikke bare er en udfordring i stedet? Udfordringer er også noget man kan vælge at tage imod. Det ligger i ordet. Men ikke alle udfordringer er så u-udfordringsatiske, at man kan ignorere dem. Nogen udfordringer er så udfordrende, at man bliver nødt til at gøre noget ved dem.

Den anden udfordring er, at en udfordring jo ikke holder op med at være udfordrende, blot fordi vi kalder det for en udfordring i stedet for en udfordring. Det er lidt ligesom oprindeligt neutrale betegnelser for folk med høj koncentration af melanin i huden. De er ikke længere neutrale (altså betegnelserne), de er blevet negativt ladede. Det er de blevet på grund af racisme. Så nu bruger vi et andet ord, der er neutralt. Men racismen er ikke forsvundet, så om lidt er det nye, neutrale, ord lige så negativt ladet som det gamle, oprindeligt neutrale, men nu negative ord. Udfordring er et pænt nyt ord, som vi kan bruge i stedet for det grimme gamle ord, udfordring. Men fordi virkeligheden nu engang er sådan at udfordringer jo sjældent rent faktisk er positive – hvis de var det, var de jo ikke udfordringer, så virker det kun en kort tid. Inden vi ser os om, er udfordringer noget lige så negativt ladet som udfordringer oprindeligt var. Og så begynder vi at tale om muligheder i stedet for udfordringer.

Har du fået konstateret kræft? Det er ikke en udfordring. Det er nu en mulighed. Den skal du bare gribe. Og så bliver sproget endnu mere absurd. Når man ikke må sige udfordring, men skal sige mulighed i stedet for mulighed, så går det helt galt. Muligheder lyder pænt. Pænere end muligheder. Men fordi vi kalder det for en mulighed, er det jo ikke holdt op med at være en mulighed. Muligheden er stadig negativ. Nu er vi blot begyndt at bruge et ord der i endnu højere grad implicerer en valgmulighed. Så når jeg står med en mulighed, det kunne være at der er et vandrør der er sprunget, så skal jeg forsøge at se mulighedsløsninger i den mulighed jeg har fået fordi gulvet sejler. Min underbo har også fået en mulighed, idet der drypper vand ned fra hans loft. Der må han jo bare se det positive i at han nu har fået mulighed for at løse en mulighed.

Der er også den mulighed, at folk der får at vide at de skal sige mulighed i stedet for mulighed, muligvis begynder at føle sig lidt til grin. De betragter det som muligt at muligheder faktisk er muligstiske, og at de ikke bliver mindre negative af at vi siger mulighed i stedet for mulighed. Jeg er ret sikker på at min underbo vil kigge mærkeligt på mig når jeg fortæller ham at han har en mulighed når hans gulvtæppe er blevet vådt fordi jeg har en mulighed med en oversvømmelse i køkkenet. Muligvis er der en mulighed for at folk der bruger ordet mulighed i stedet for bare at kalde tingene ved deres rette navn, bliver lidt til grin. Eller at de ikke længere bliver taget seriøst. Jeg er ikke helt overbevist om at HR-chefen vil betragte det som en mulighed, hvis hans medarbejdere betragter ham med skepsis hver gang han siger mulighed.

Muligvis skal vi i virkeligheden forsøge at skære udfordringerne ind til benet, og erkende at det ikke kan afvises, at der faktisk er tale om et problem. Hive fat i værktøjskassen med redskaber til problemløsning. Og i stedet for bare at italesætte tingene på en måde der er i overensstemmelse med den sidste managementmode, rent faktisk løse problemerne.

 

Fake News. And the importance of numbers

Or rather, the importance of all the numbers.

This is not a political blog. On the other hand, I am a firm believer in facts. There is an objective truth out there. And the world can, more or less, be explained by numbers. And when facts and numbers are weaponized in political struggles, well, it’s hard to avoid wandering in to dangerous territory.

And with that disclaimer out of the way. Recently I saw a graphic making the rounds on Facebook. A simple table listing median household income in the US by race. The numbers are said to come from the US Census Bureau.

The idea is to document that “white privilege” does not exist. “See all the people that are not white, who makes more money than whites”.

Could this be true? As a european, my expectation would be, that white, caucasian, americans are in general terms better off than, for lack of a better word, non-whites, in the US (see, this is dangerous territory, what words do I dare to use?).

One of the great things about US federal institutions is that they generally provides quite free access to their data. I did not have the patience to learn how to navigate their website. Someone at Wikipedia did. And have made this nifty table.

I’m not going to copy all the data. There’s a LOT. But there is three tables. Median household income by race, by ancestry, and by native american tribe. Let us take a look at the first:

Rank Race Median household income (2015 USD)
1 Asian-American 91,440
2 White 59,698
3 Native Hawaiian and other pacific islander 55,607
4 Some other race 42,461
5 American Indian and Alaska Native 38,530
6 Black or African American 36,544

The numbers does not quite match. But “white” at 59,698 USD compares OK with the 60,256 USD in the graphic. The graphic claims that the numbers are from 2014. The Wikipedia numbers are from 2015. Small differences should not throw us off course.

That is more in sync with my expectations. But are the numbers in the table cooked?

Nope. They are actually correct. The next table on Wikipedia lists median household incomes by ancestry. Indian American in the graphic: 101,561 USD. Indian American on Wikipedia: 101,591 USD. Same with Taiwanese. And the others. There is a small detail. The numbers by ancestry appears to come from the 2014 data from the US Census Bureau, and not the 2015. Again, this is a detail. The main point of the graphic is not that Indian Americans make 41,335 USD more than whites, but rather that they make more money. And that Taiwanese Americans do, and that Filipino Americans do and that… well you get the point.

All right, so what is wrong with that table? The thing that is wrong, is that it cherrypicks the data. Let us take a look at the table of median household income by ancestry. Just the first six rows:

 

Rank Ancestry Income
1 Indian American 101,591
2 Taiwanese American 85,566
3 Filipino American 82,389
4 Australian American 81,452
5 Israeli American 79,736
6 European American 77,440

I will NOT be dragged into the battles about what constitutes a race, how white you should be to be considered white or the whole topic of trans-racialism.
But in the context of the original table, “European American” is rather white. And when we get to “Danish American” (median income 68,558 USD) we are, statistically speaking, talking about people who are very white.

Conclusion: It is not enough to check that the numbers are correct. You also need to check that you have all the numbers.

Stuff to keep in mind

Cynical lessons for lower middle management in times of change.

  • There are no one else but you, that takes care of you.
  • Do not expect any kind of support. It might come, but don’t count on it.

More lessons to come as I finally learn them.

7/11 in Denmark – a map

Let’s be honest. I am easily distracted. While I was thinking about how to plot networks of coauthorships in Acta Chemica Scandinavica, I tinkered with getting data on my twitter-following.

Thats easy enough, but I thought it would be cool to map them. While googling that (I know. There are automated ways to do that, there are scripts I can just copy. Its not difficult. I just want to do it myself.) I stumbled across some neat heatmaps visualizing the distance to fastfood outlets in the US. That looked like fun.

So now, I’m hammering away at that at the moment….

What to map? I need coordinates. For some reason I thought of 7/11. Probably something with those US fastfood stores. Anyway, http://www.7-eleven.dk/find-butik/ has a map. And coordinates on the page!

Save, download. Delete everything that does not contain coordinates. Keep the parts with “value=[coordinates]”. And save the file. And then, some R. Set the working directory, readlines, use stringfunctions to extract the coordinates:

rm(list=ls())
setwd("~/Desktop/711")
koo <- readLines("711_coordinates.html")
koo_list <- c()
for(i in 1:length(koo)){
  res <- strsplit(koo[i], 'value=\"')
  res <- strsplit(res[[1]][2],'\">')
  res <- res[[1]][1]
  koo_list <- c(koo_list, res)
}

I have a nagging suspision that there are repeated coordinates. Lets get rid of those, and take a look:

koo <- unique(koo_list)
head(koo)
## [1] "55.68096,12.58000"   "55.681899,12.583777" "55.68216,12.57454"  
## [4] "55.67787,12.58015"   "55.68225,12.57044"   "55.68284,12.57098"

I need to get that into a dataframe:

koo_df <- data.frame(lat=character(), lng=character(), stringsAsFactors=FALSE)

for(i in 1:length(koo)){
  koordinat <- strsplit(koo[i],',')
  new_row <- c(as.numeric(koordinat[[1]][1]),as.numeric(koordinat[[1]][2]))
  koo_df <- rbind(koo_df, new_row, stringsAsFactors=FALSE)

}
colnames(koo_df) <- c("lat", "lng")
tail(koo_df)
##          lat      lng
## 181 57.05891  9.92778
## 182 57.15559  9.73841
## 183 57.45580 10.04221
## 184 57.45665  9.98596
## 185 57.44108 10.53995
## 186       NA       NA

Oh, and I should get rid of any missing values:

row.has.na <- apply(koo_df, 1, function(x){any(is.na(x))})
koo_df <- koo_df[!row.has.na,]

I need some libraries:

library("maps")
library("mapdata")
library('sp')
library('maptools')
## Checking rgeos availability: FALSE
##      Note: when rgeos is not available, polygon geometry     computations in maptools depend on gpclib,
##      which has a restricted licence. It is disabled by default;
##      to enable gpclib, type gpclibPermit()
library('spatstat')
## Loading required package: nlme
## Loading required package: rpart
## 
## spatstat 1.46-1       (nickname: 'Spoiler Alert') 
## For an introduction to spatstat, type 'beginner'
## 
## Note: spatstat version 1.46-1 is out of date by more than 11 weeks; a newer version should be available.

Maps helps drawing maps. Mapdata provides maps for the world. sp gives methods for handling spatial data. Lines and polygons eg. I’m going to need that. Maptools is another bunch of tols for handling maps. Spatstat is a set of tools for handling spatial statistical data. I only really need one of them (I think). But it is the most important.

Lets plot something:

kort <- map('worldHires', 'Denmark', col="white", fill=TRUE, bg=NA, 
    xlim=c(8,17), ylim=c(53,58), resolution=0)

Map takes a lot of different arguments. The first two: worldHires tells it that it should look to the worldHires map from mapdata. And that we should look specifically to the part called Denmark. col allows me to choose the color of the map – I’m going with black’n’white here. Fill – should the areas be filled with the color (yes, they should, for reasons I’ll get back to.) bg is the background color, and I can limit the part of the map I want to draw. Maps have coordinates as degrees as their natural units. Longitude 8 to 17, latitude 53 to 58 does nicely. The Faroe Islands are in the map (but for some reason not Greenland), I’m not going to bother with them. And Bornholm is going to be cut out later. Well, Bornholm is going to get cut now:

kort <- map('worldHires', 'Denmark', col="green", fill=TRUE, bg="blue", 
    xlim=c(8,13), ylim=c(54,58), resolution=0)

Resolution – there are two settings, 0 and 1. 0 gives the highest resolution.

There are some problems here. Smaller islands have disappered. The resolutions is not fantastic. But all that can be easily solved later. The hard part is the following.

Lets get back to the original map, and take a look:

kort <- map('worldHires', 'Denmark', col="white", fill=TRUE, bg=NA, 
    xlim=c(8,13), ylim=c(54,58), resolution=0)

str(kort)
## List of 4
##  $ x    : num [1:11164] 8.66 8.68 8.69 8.7 8.72 ...
##  $ y    : num [1:11164] 54.9 54.9 54.9 54.9 54.9 ...
##  $ range: num [1:4] 8.09 12.62 54.56 57.75
##  $ names: chr [1:10] "Denmark" "Denmark:Mors" "Denmark:Mon" "Denmark:Samso" ...
##  - attr(*, "class")= chr "map"

The kort-variable (kort because that is danish for map) is a list of four lists. x and y that defines the shapes. How that is actually done given that Denmark consist of 1? islands, I have no idea. There is a range. And there are 10 named areas in the object. All of them Denmark, but some of them islands like Mors, Møn etc.

Lets also plot the location of 7/11 stores in Denmark:

map(kort)
points(koo_df$lng, koo_df$lat)

We’ll get back to that.

What I want to plot is a distancemap. Different colors based on how far a given position.

The library spatstat has a methods/function that handles that. It’s called distmap(). It takes an object of class “ppp” – that can be made with the function ppp(), also from spatstat.

ppp() takes two vectors for x and y coordinates, a window, defining the, well, window, of the map. And that is of the class “owin”.

Therefore, I should begin by making the point pattern dataset::

A <- ppp(koo_df$lng,koo_df$lat,window=owin(xrange=c(7,13),yrange=c(53,58)))
plot(A)

It looks a bit skewed. But otherwise ok. Distmap() takes the ppp object. So its rather simple to calculate and plot:

z <- distmap(A)

plot(z)

Yeah. Very psychedelic. Looking at it, it appears that the scale is in degrees. The yellow colour should be “2.5”. That more or less matches 2½ degree, if you remember that the y-scale is 5 degrees high.

One can almost see something that looks like the outline of Denmark. It gets easier it you know that it is there.

What would be nice, would be to define a window matching Denmark. Make a cut-out of Denmark in the distancemap.

That can be done. Spatstat has a function as.owin.SpatialPolygons() that can take polygons, defined elsewhere, and convert them to a windows, that can be used in the ppp()-function. It takes an object of the type spatialpolygons. Where do we get that? maptools has a function that can convert a map to spatialpolygons: map2SpatialPolygons(). That function takes a map. And a list of the names saved in that map. We already have those. kort, and kort$names. So, first map2SpatialPolygons:

kort.poly <- map2SpatialPolygons(kort, IDs = kort$names)

Then as.owin.SpatialPolygons:

dk.owin <- as.owin.SpatialPolygons(kort.poly)

And now I have a nice cutout shapes like Denmark! I can even plot it!!

plot(dk.owin)

Neato! Now I just need to use the dk.owin window when defining the ppp object, rather than the rectangular window I used earlier:

A <- ppp(koo_df$lng,koo_df$lat,window=dk.owin)
## Warning in ppp(koo_df$lng, koo_df$lat, window = dk.owin): 20 points were
## rejected as lying outside the specified window

Run the distmap function:

dist <- distmap(A)

and plot it:

plot(dist)

Done! I do get a warning. 20 points are outside the window. Not that big a surprise, the map is not very precise. Whole islands are missing! But looking at the original map, and what I get when I plot the window, something is happening. And maybe I should take a closer look at what projection that is being used. After all, Denmark lies on a sphere. And the scale is annoying. I want it in kilometers, not degrees.

Acta network part 1

Lets get to work on the network. I don’t have any citation lists, so what I’m after here, is coauthorship. I’m going to draw on some unpublished work I did on Zika-virus research.

I can begin by disregarding all papers with only one author. I want to keep the publication year – as I want to animate the network.

As always, we’ll begin by reading the data:

data <- readRDS(file="d:\\acta\\consistentdata.rda")
head(data)

Continue reading

Animated Acta Chem. Scand.

Okay. I would like to make some networks. I would like to animate them. And I’m quietly harvesting all the pdf’s in order to OCR them, and see what I can learn from that.

Before animating networks, it might be a good idea to animate something simpler.

We went on a visit to my sister in law, and on the rather long trip (by train), I had time to read up on network-graphs in R. Some of the last pages were about animation.

so, without further ado, let me introduce this new library:

library(animation)
## Warning: package 'animation' was built under R version 3.2.5

I’ll get back to that, lets begin by getting at the data:

Continue reading

Average number of authors in Acta Chem. Scand.

We saw the rapid decline of german as a scientific language in the last installment. But what else can we learn? Given that I do not have access to the full text papers.

Well, what about the number of authors on a given paper?

I have a hypothesis. In the beginning of time, chemistry was a lonely science, where individual scientists worked and published alone.

As the years went by, chemistry became a more collaborative science, with more scientists working together, and also publishing together.

So – the average number of authors on a paper will rise, as a function of time.

There is only one way to see if I’m right. Continue reading

Prisindex

Vi tager lige en pause fra Acta Chemica Scandinavica, og så endda på dansk!

Alt dette er i øvrigt blot en note til mig selv, til næste gang jeg får behov for at sammenligne beløb over tid. Størstedelen af teksten er i virkeligheden blot underholdning.

Hvis jeg havde 1000 kr i 2009, hvor mange kroner svarer det så til i dag? Continue reading