Ropencorporate: A Package to Access Opencorporate API
A couple of month ago, I went to the London open data mission meetup in Shoreditch. It is a meetup organized by the company Open corporates, a company which gather officials documents on company creation and dissolution and make these documents available through an API.
The problem I had, back there, was that everybody was working with python. And even if I am ok to work with python, my productivity is not the same. So to make myself “at home”, I have created the R package Ropencorporate
to query the API.
Initialisation
First thing first, let’s set up the session, query the data and do some cleaning.
# download and install the package library(devtools)
devtools::install_github("YvesCr/Ropencorporate")
# Attach the libraries library(Ropencorporate, quietly = T) library(data.table)
library(stringr) library(DT) library(leaflet) library(dygraphs)
library(lubridate, quietly = T, warn.conflicts = F)
library(networkD3, quietly = T, warn.conflicts = F)
We choose the term “edf”, which is short for “Electricité de France”, a company I appreciate and would not want to see having issue due to corruption and bad decisions.
term <- "edf"
To query the database, we use the function get.companies
. Without credential, we are limited to the 20 first pages of result. For people with a token, an option token
is implemented.
res.oc <- Ropencorporate::get.companies(term, nb.page = 20)
oc.dt <- res.oc$oc.dt prev.dt <- res.oc$prev.dt
The result we get is a bit messy: The same company could appear under multiple names due to multiple format used. This is a complex task which require human assistance. Nevertheless, we could use a couple of transformation to clean the name.
For a complete cleaning, a good option is to use open refine to do this work. But it is not the point here, so we just use the fingerprint method to simplify the name. For that, we use the fingerprint.func
function plus a bit of work to keep a non formatted version of the name.
## What I want: the most frequent occurence of the name
# First, we use the fingerprint method to normalise the name
oc.dt[, name2 := unlist(lapply(name, fingerprint.func))]
# Taking the fingerprint reduce the number of company names by 10 %
# Now, replace name2 by the top name appearing in term of number, then by the first appearing.
oc.dt[, freq.name := .N, by = "name"]
oc.dt[, freq.name2 := max(freq.name), by = "name2"] # for a subset, select the top name
oc.dt[, Name := name[max(which(freq.name2 == freq.name))], by = "name2"]
#delete unused rows:
oc.dt[, freq.name := NULL] ; oc.dt[, freq.name2 := NULL]; oc.dt[, name2 := NULL]
We tighten the date variables.
oc.dt[, Creation.date := as.Date(created.at)]
oc.dt[, Update.date := as.Date(updated.at)] oc.dt[, Retrieved.date := as.Date(retrieved.at)]
oc.dt[, Dissolution.date := as.Date(dissolution.date)]
Basic overview
Now, let’s have a look at the result of the query: Number of different companies which have the term, places where the term is the more common, etc..
Occurences of the name in the query
Which are the entries with the term edf appearing the most?
num.oc <- dim(oc.dt)[1]
datatable(oc.dt[, list(Frequency = .N, Percentage = sprintf("%.1f%%", .N / num.oc * 100)), by = "Name"][order(Frequency, decreasing = T)], options = list(pageLength = 10))
Occurences of the juridictions
Which are the juridictions with the most entries?
datatable(oc.dt[, list(Frequency = .N, Percentage = sprintf("%.1f%%", .N / num.oc * 100)), by = "jurisdiction.code"][order(Frequency, decreasing = T)], options = list(pageLength = 10))
Most recent apparition
Which are the more recent companies?
datatable(oc.dt[, list(Frequency = .N, Percentage = sprintf("%.1f%%", .N / num.oc * 100)), by = c("Creation.date", "Name")][order(Creation.date, decreasing = T)], options = list(pageLength = 10))
oc.dt[, Creation.date.m := as.Date(format(Creation.date, "01 %m %Y"), "%d %m %Y")]
oc.dt[, Dissolution.date.m := as.Date(format(Dissolution.date, "01 %m %Y"), "%d %m %Y")]
oc.dt.date0 <- oc.dt[, .N, by = c("Name", "Creation.date.m")][, list(Creation.date = .N), by = c("Creation.date.m")]
Dissolution dates are not of good quality: We keep only the ones inside the boundaries between the first creation and today.
oc.dt.date1 <- oc.dt[!is.na(Dissolution.date.m) & Dissolution.date.m < Sys.Date() & Dissolution.date.m >= min(oc.dt.date0[, Creation.date.m]), .N, by = c("Name", "Dissolution.date.m")][, list(Dissolution.date = .N), by = c("Dissolution.date.m")]
oc.dt.date <- merge(oc.dt.date0, oc.dt.date1, by.x = "Creation.date.m"
, by.y = "Dissolution.date.m", all = T)
oc.dt.date[is.na(Creation.date), Creation.date := 0]
oc.dt.date[is.na(Dissolution.date), Dissolution.date := 0]
dygraph(oc.dt.date, main = "Creation & dissolution of companies")
There is a lot of creation of companies, but only a few dissolution. Some companies are bought and not dissolved.
Current status
Which are the most frequent status?
datatable(oc.dt[, list(Frequency = .N, Percentage = sprintf("%.1f%%", .N / num.oc * 100)), by = "current.status"][order(Frequency, decreasing = T)], options = list(pageLength = 10))
Inactive
Number of inactive companies?
datatable(oc.dt[, list(Frequency = .N, Percentage = sprintf("%.1f%%", .N / num.oc * 100)), by = "inactive"][order(Frequency, decreasing = T)], options = list(pageLength = 10))
Branch status
datatable(oc.dt[, list(Frequency = .N, Percentage = sprintf("%.1f%%", .N / num.oc * 100)), by = c("branch.status")][order(Frequency, decreasing = T)], options = list(pageLength = 10))
Company type
datatable(oc.dt[, list(Frequency = .N, Percentage = sprintf("%.1f%%", .N / num.oc * 100)), by = "company.type"][order(Frequency, decreasing = T)], options = list(pageLength = 10))
Create an interactive map
To create a map, we use the dataset world.states
include in the package. This dataset contain the coordinates of all the jurisdiction available in the opencorporate API.
# For the popup version, we create a label.
display <- oc.dt[, list(frequency = .N, display = paste(.N, " links here:\n <a href='", paste0(opencorporates.url, "'>", Name, collapse = "</a>\n"), "</a>\n")), by = "jurisdiction.code"]
# we add the coordinate of the juridiction
display.center <- merge(display, world.states, all.x = T, by = "jurisdiction.code")
# get rid of non geocoded queries
display.center.nomiss <- display.center[!is.na(display) & !is.na(lat)] # map
map.leaflet <- leaflet() %>% addTiles() %>% addMarkers(display.center.nomiss$lng, display.center.nomiss$lat, display.center.nomiss$display,
options = popupOptions(closeButton = FALSE)) map.leaflet
Create a graph of companies transactions
One simple and excellent thing to do it is to look at company which had been bought and sold quite often. For this, we use the networkD3
package, which allows us to create a graph of the links between companies. We currently only look at interactions of more than 3 nodes, aka, at least two changes of name. Otherwise, the number of nodes is too big.
# creation of the data frame with target and links data
links <- merge(prev.dt[!is.na(company.name)], oc.dt
, by = c("jurisdiction.code", "company.number")
, all.x = T)[!(is.na(name) | is.na(company.name) | company.name == name)][
, list(source = name, target = company.name)]
# index of multiple sales companies + cretion of a short links table
multiple.name <- data.table(table(c(links$source, links$target)))[N>2, as.character(V1)]
links.short <- links[source %in% multiple.name | target %in% multiple.name] # Plot
simpleNetwork(networkData <- data.frame(links.short$source
, links.short$target), fontSize = 10)
The only minor point with this method is that we don’t know from the graph which one bought the other. But we could see the big clusters of companies. For exemple, for EDF, we could see that EDF energy Group Holding PLC is very active when it comes to modification of names.
Leave a Comment