Ropencorporate: A Package to Access Opencorporate API

11 minute read

A couple of month ago, I went to the London open data mission meetup in Shoreditch. It is a meetup organized by the company Open corporates, a company which gather officials documents on company creation and dissolution and make these documents available through an API.

The problem I had, back there, was that everybody was working with python. And even if I am ok to work with python, my productivity is not the same. So to make myself “at home”, I have created the R package Ropencorporate to query the API.

Initialisation

First thing first, let’s set up the session, query the data and do some cleaning.

# download and install the package library(devtools) 
devtools::install_github("YvesCr/Ropencorporate")
# Attach the libraries library(Ropencorporate, quietly = T) library(data.table) 
library(stringr) library(DT) library(leaflet) library(dygraphs) 
library(lubridate, quietly = T, warn.conflicts = F) 
library(networkD3, quietly = T, warn.conflicts = F)

We choose the term “edf”, which is short for “Electricité de France”, a company I appreciate and would not want to see having issue due to corruption and bad decisions.

term <- "edf"

To query the database, we use the function get.companies. Without credential, we are limited to the 20 first pages of result. For people with a token, an option token is implemented.

res.oc <- Ropencorporate::get.companies(term, nb.page = 20) 
oc.dt <- res.oc$oc.dt prev.dt <- res.oc$prev.dt

The result we get is a bit messy: The same company could appear under multiple names due to multiple format used. This is a complex task which require human assistance. Nevertheless, we could use a couple of transformation to clean the name.

For a complete cleaning, a good option is to use open refine to do this work. But it is not the point here, so we just use the fingerprint method to simplify the name. For that, we use the fingerprint.func function plus a bit of work to keep a non formatted version of the name.

## What I want: the most frequent occurence of the name  
# First, we use the fingerprint method to normalise the name 
oc.dt[, name2 := unlist(lapply(name, fingerprint.func))]  
# Taking the fingerprint reduce the number of company names by 10 %  
# Now, replace name2 by the top name appearing in term of number, then by the first appearing. 
oc.dt[, freq.name := .N, by = "name"] 
oc.dt[, freq.name2 := max(freq.name), by = "name2"]  # for a subset, select the top name 
oc.dt[, Name := name[max(which(freq.name2 == freq.name))], by = "name2"]  
#delete unused rows: 
oc.dt[, freq.name := NULL] ; oc.dt[, freq.name2 := NULL]; oc.dt[, name2 := NULL]

We tighten the date variables.

oc.dt[, Creation.date := as.Date(created.at)] 
oc.dt[, Update.date := as.Date(updated.at)] oc.dt[, Retrieved.date := as.Date(retrieved.at)] 
oc.dt[, Dissolution.date := as.Date(dissolution.date)]

Basic overview

Now, let’s have a look at the result of the query: Number of different companies which have the term, places where the term is the more common, etc..

Occurences of the name in the query

Which are the entries with the term edf appearing the most?

num.oc <- dim(oc.dt)[1] 
datatable(oc.dt[, list(Frequency = .N, Percentage = sprintf("%.1f%%", .N / num.oc * 100)), by = "Name"][order(Frequency, decreasing = T)], options = list(pageLength = 10))

Occurences of the juridictions

Which are the juridictions with the most entries?

datatable(oc.dt[, list(Frequency = .N, Percentage = sprintf("%.1f%%", .N / num.oc * 100)), by = "jurisdiction.code"][order(Frequency, decreasing = T)], options = list(pageLength = 10))

Most recent apparition

Which are the more recent companies?

datatable(oc.dt[, list(Frequency = .N, Percentage = sprintf("%.1f%%", .N / num.oc * 100)), by = c("Creation.date", "Name")][order(Creation.date, decreasing = T)], options = list(pageLength = 10))

oc.dt[, Creation.date.m := as.Date(format(Creation.date, "01 %m %Y"), "%d %m %Y")] 
oc.dt[, Dissolution.date.m := as.Date(format(Dissolution.date, "01 %m %Y"), "%d %m %Y")] 
 
oc.dt.date0 <- oc.dt[, .N, by = c("Name", "Creation.date.m")][, list(Creation.date = .N), by = c("Creation.date.m")]

Dissolution dates are not of good quality: We keep only the ones inside the boundaries between the first creation and today.

oc.dt.date1 <- oc.dt[!is.na(Dissolution.date.m) & Dissolution.date.m < Sys.Date() & Dissolution.date.m >=  min(oc.dt.date0[, Creation.date.m]), .N, by = c("Name", "Dissolution.date.m")][, list(Dissolution.date = .N), by = c("Dissolution.date.m")] 
 oc.dt.date <- merge(oc.dt.date0, oc.dt.date1, by.x = "Creation.date.m" 
                    , by.y = "Dissolution.date.m", all = T)  
oc.dt.date[is.na(Creation.date), Creation.date := 0] 
oc.dt.date[is.na(Dissolution.date), Dissolution.date := 0]   
dygraph(oc.dt.date, main = "Creation & dissolution of companies")

There is a lot of creation of companies, but only a few dissolution. Some companies are bought and not dissolved.

Current status

Which are the most frequent status?

datatable(oc.dt[, list(Frequency = .N, Percentage = sprintf("%.1f%%", .N / num.oc * 100)), by = "current.status"][order(Frequency, decreasing = T)], options = list(pageLength = 10))

Inactive

Number of inactive companies?

datatable(oc.dt[, list(Frequency = .N, Percentage = sprintf("%.1f%%", .N / num.oc * 100)), by = "inactive"][order(Frequency, decreasing = T)], options = list(pageLength = 10))

Branch status

datatable(oc.dt[, list(Frequency = .N, Percentage = sprintf("%.1f%%", .N / num.oc * 100)), by = c("branch.status")][order(Frequency, decreasing = T)], options = list(pageLength = 10))

Company type

datatable(oc.dt[, list(Frequency = .N, Percentage = sprintf("%.1f%%", .N / num.oc * 100)), by = "company.type"][order(Frequency, decreasing = T)], options = list(pageLength = 10))

Create an interactive map

To create a map, we use the dataset world.states include in the package. This dataset contain the coordinates of all the jurisdiction available in the opencorporate API.

# For the popup version, we create a label. 
display <- oc.dt[, list(frequency = .N, display = paste(.N, " links here:\n <a href='", paste0(opencorporates.url, "'>", Name, collapse = "</a>\n"), "</a>\n")), by = "jurisdiction.code"] 
 # we add the coordinate of the juridiction 
display.center <- merge(display, world.states, all.x = T, by = "jurisdiction.code")  
# get rid of non geocoded queries 
display.center.nomiss <- display.center[!is.na(display) & !is.na(lat)]  # map 
map.leaflet <- leaflet() %>% addTiles() %>%  addMarkers(display.center.nomiss$lng, display.center.nomiss$lat, display.center.nomiss$display, 
    options = popupOptions(closeButton = FALSE)) map.leaflet

Create a graph of companies transactions

One simple and excellent thing to do it is to look at company which had been bought and sold quite often. For this, we use the networkD3 package, which allows us to create a graph of the links between companies. We currently only look at interactions of more than 3 nodes, aka, at least two changes of name. Otherwise, the number of nodes is too big.

# creation of the data frame with target and links data 
links <- merge(prev.dt[!is.na(company.name)], oc.dt 
    , by = c("jurisdiction.code", "company.number") 
    , all.x = T)[!(is.na(name) | is.na(company.name) |           company.name == name)][ 
    , list(source = name, target = company.name)]  
# index of multiple sales companies + cretion of a short links table 
multiple.name <- data.table(table(c(links$source, links$target)))[N>2, as.character(V1)] 
links.short <- links[source %in% multiple.name | target %in% multiple.name]  # Plot 
simpleNetwork(networkData <- data.frame(links.short$source 
    , links.short$target), fontSize = 10)

The only minor point with this method is that we don’t know from the graph which one bought the other. But we could see the big clusters of companies. For exemple, for EDF, we could see that EDF energy Group Holding PLC is very active when it comes to modification of names.

Leave a Comment