I was working on temporal knowledge graph (tKG) when the datasets and their usage confused me. After having read multiple documents and sorting much information out, I decided to write this article explaining them in plain words for anyone that’s new to the field.
Starting from GDELT — a “global database of events, language, and tone”.
Coverage of GDELT
The GDELT project is covering too many data in a too wide topic range that can be frustrating at first glance. In a nutshell, it covers:
- Streams of geo-social and geo-political events in temporal sequence;
- Event attributes like country/region, tone, appearance in the media, potential event impact, etc;
- Relations between events.
Among them, 1 & 2 are enclosed in one dataset called Event Database, and 3 is enclosed in one called GKG (Global Knowledge Graph).
There are some other sub-datasets for specific usage, e.g. academic paper citing graph, but they will be left unexplained in this article. (I will add more if someday I have to use them for my projects).
The GDELT project has two versions. V1.0 covers events from 2013 to early 2015, and V2.0 covers from early 2015 to now. Both are updated every 15 minutes. The data in V2.0 now contains all events from V1.0, with the new notations and attributes as V2.0. In short words, if you are applying the events contained in the dataset to your tKG now, go directly for V2.0.
Where to retrieve the datasets?
All data are stored via Google BigQuery. You need a Google Cloud Platform (GCP) account for the following data preview, query, and download.
The direct dataset link is: https://bigquery.cloud.google.com/table/gdelt-bq:gdeltv2.events.
Add the linked dataset to your own cloud console, you can then play around with it. Google BigQuery follows SQL syntax, which should be intuitive and easy enough to get started even if you know nothing about SQL. Also with the GUI and customised SQL commands, you should be able to filter and download the filtered data at clicks and minimal typing. (I’m not going to add any resources for SQL here though. Googling for commands for certain needs is the fastest learning way from my side.)
The data is in .csv format, but the true delimiter is “\t” or tab. They are ready to be processed in Python with some_str.strip() command.
The Event Database
This is what we used for/as our tKG. The event database contains one event in each line in the .csv, and each event contains three parts: actor1, actor2, and action from actor1 to actor2. Correspondingly in tKG, they are called subject, object, and predicate/relation. Additional information includes time of the event, event category, potential event impact (Goldstein score), categories of the actors and actions, tone/sense of the action, etc.
The link for the codebook is here: http://data.gdeltproject.org/documentation/GDELT-Event_Codebook-V2.0.pdf, where attribute codes and meanings are explained.
Most codes follows CAMEO. A full collection of the code meanings is here: https://www.gdeltproject.org/data/documentation/CAMEO.Manual.1.1b3.pdf. Its mother directory contains more related documentations and SQL briefings.
There are other documents related to GDELT V2.0 here: https://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-realtime/.
The blog also contains other progress, news, and usage of GDELT. (I find information there messy though 🤫 which is also why I wrote this to clarify for anyone in need, incl. myself.)
Something Not related to GDELT
This is the first blog I’ve ever wrote! Not quite expecting anyone to see, but if you do, please don’t hesitate to leave any comments and/or applause. I have been trying to present jargons and complex information in a friendly-for-dummy way for years, and hope my efforts worked at least a little bit. Thank you for reading, being among my first batch of audience and supporting!