In this tutorial, we will show how to use the tidytext package to convert the Manifesto Corpus into a tidy text format. We assume that you have already read the tutorial on the First steps with manifestoR.
Tidy data and tidytext
The tidy text format is inspired by the tidy data format (Wickham 2014). Data is tidy if
- each variable is a column
- each observation is a row
- each type of observational unit is a table
In other context, tidy data is also known as “long” format.
The tidy text format picks up three principles of tidy data. Tidy text is a format where information is stored in “a table with one-token-per-row”" (Silge and Robinson 2016). This is in contrast to the idea of term-document-matrices or document-feature matrices that are commonly used in text analysis.
The advantage of the tidytext format is that it allows the use of functions many users are familiar with from managing and cleaning “normal” data sets.
The tidytext package provides functions to transform several other text data formats into a tidy text format. These functions can also be applied to the Manifesto Corpus format. In the following, we will show how to use the functions of the tidytext package to convert the Manifesto Corpus into a tidy text format.
tidytext package
If you have not installed the manifestoR or tidytext package, you need to install them first with install.packages("manifestoR")
and/or install.packages("tidytext")
. As every sesions using the Manifesto Corpus, you need to set your api-key. To learn more about the api-key and manifestoR, see the tutorial “First steps with manifestoR”. Moreover, we fix the corpus version using the mp_use_corpus_version
function. This ensure that the script does not break if a new corpus version is published as by default the latest corpus version is used.
library(manifestoR)
library(tidytext)
library(dplyr)
library(ggplot2)
mp_setapikey(key.file = "manifesto_apikey.txt")
mp_use_corpus_version("2017-2")
The mp_corpus
returns a ManifestoCorpus object in the Corpus format of the tm-package (see the “First steps…” tutorial for more information). We use the manifestos of the Irish 2016 election as exemplary case here.
<- mp_corpus(countryname == "Ireland" & date == 201602)
ireland2016_corpus ireland2016_corpus
## <<ManifestoCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 10
The tidy()
function transforms the ManifestoCorpus into a data frame where each row represents one document. Variables are the meta-information from the corpus as well as an additional variable named “text” that contains the whole text for each document.
<- ireland2016_corpus %>% tidy()
tidied_corpus tidied_corpus
## # A tibble: 10 x 17
## manifesto_id party date language source has_eu_code is_primary_doc
## <chr> <dbl> <dbl> <chr> <chr> <lgl> <lgl>
## 1 53110_201602 53110 201602 english MARPOR FALSE TRUE
## 2 53231_201602 53231 201602 english MARPOR FALSE TRUE
## 3 53240_201602 53240 201602 english MARPOR FALSE TRUE
## 4 53250_201602 53250 201602 english MARPOR FALSE TRUE
## 5 53320_201602 53320 201602 english MARPOR FALSE TRUE
## 6 53321_201602 53321 201602 english MARPOR FALSE TRUE
## 7 53520_201602 53520 201602 english MARPOR FALSE TRUE
## 8 53620_201602 53620 201602 english MARPOR FALSE TRUE
## 9 53951_201602 53951 201602 english MARPOR FALSE TRUE
## 10 53981_201602 53981 201602 english MARPOR FALSE TRUE
## # … with 10 more variables: may_contradict_core_dataset <lgl>,
## # md5sum_text <chr>, url_original <chr>, md5sum_original <chr>,
## # annotations <lgl>, handbook <chr>, is_copy_of <chr>, title <chr>, id <chr>,
## # text <chr>
The most important function of the tidytext package is the unnest_tokens
function. It tokenizes the text
variable into words (or other tokens) and creates one row per token - making the data frame tidy. The unnest_token function by default transforms all characters to lower case.
<- tidied_corpus %>%
tidy_df unnest_tokens(word, text)
%>%
tidy_df select(manifesto_id, word) %>%
head(15)
## # A tibble: 15 x 2
## manifesto_id word
## <chr> <chr>
## 1 53110_201602 think
## 2 53110_201602 ahead
## 3 53110_201602 act
## 4 53110_201602 now
## 5 53110_201602 general
## 6 53110_201602 election
## 7 53110_201602 manifesto
## 8 53110_201602 2016
## 9 53110_201602 progressive
## 10 53110_201602 practical
## 11 53110_201602 and
## 12 53110_201602 sustainable
## 13 53110_201602 politics
## 14 53110_201602 for
## 15 53110_201602 the
Cleaning and preprocessing
The tidy format allows to make use of the dplyr grammar to preprocess and clean the data. To delete stopwords we make us of a stop word collection that comes with the tidytext package. The argument here is a tidytext function that returns a dataframe with a list of stopwords (frequent but little meaningful words).
get_stopwords()
## # A tibble: 175 x 2
## word lexicon
## <chr> <chr>
## 1 i snowball
## 2 me snowball
## 3 my snowball
## 4 myself snowball
## 5 we snowball
## 6 our snowball
## 7 ours snowball
## 8 ourselves snowball
## 9 you snowball
## 10 your snowball
## # … with 165 more rows
Anti_join here will only keep words that do not appear in the list dataframe provided as argument. Another advantage of the tidytext format is one can easily filter for certain characteristics. Here, we show how one can easily filter for tokens that are numbers only. The expression is.na(as.numeric(word))
filters for words that can not be transformed to numeric values. This filters out all words that are just containing numbers (such as the “2016” in the example above).
<- tidy_df %>%
tidy_without_stopwords anti_join(get_stopwords()) %>%
filter(is.na(as.numeric(word)))
%>%
tidy_without_stopwords select(manifesto_id, word) %>%
head(10)
## # A tibble: 10 x 2
## manifesto_id word
## <chr> <chr>
## 1 53110_201602 think
## 2 53110_201602 ahead
## 3 53110_201602 act
## 4 53110_201602 now
## 5 53110_201602 general
## 6 53110_201602 election
## 7 53110_201602 manifesto
## 8 53110_201602 progressive
## 9 53110_201602 practical
## 10 53110_201602 sustainable
Term frequencies and Tf-Idf
Using the count
function on the tidied data, it is very easy to obtain term frequencies of the corpus under investigation.
%>%
tidy_without_stopwords count(word, sort = TRUE) %>%
head(10)
## # A tibble: 10 x 2
## word n
## <chr> <int>
## 1 new 847
## 2 people 704
## 3 ireland 686
## 4 public 678
## 5 ensure 674
## 6 government 620
## 7 support 615
## 8 fine 578
## 9 gael 556
## 10 services 536
General term frequencies (even when calculated per document) are often not very meaningful as they do not differ very much across documents. Many applications therefore calculate the tf-idf score (term-frequency inverse-document-frequency). This detects words that appear often within one document, but rarely in other documents. Tfidf identifies words that are on the one hand frequent, but on the other hand also distinct. tidytext has a function bind_tfidf
that adds the tfidf-score to a data frame containing term frequencies and document meta data.
Before calculating the tfidf score, we get nicer document names based on the party names stored in the Manifesto Project Dataset.
<- mp_maindataset() %>%
irish_partynames filter(countryname == "Ireland" & date == 201602) %>%
select(party, partyname)
The following shows how to calculate tf-idf socres and plot the 5 highest scoring terms for each manifesto. For more information on tf-idf scores, have a look at the respective chapter in the tidy text text.
%>%
tidy_without_stopwords count(party, word, sort = TRUE) %>%
bind_tf_idf(word, party, n = n) %>%
arrange(desc(party, tf_idf)) %>%
# mutate(word = factor(word, levels = rev(unique(word)), ordered=T)) %>%
group_by(party) %>%
top_n(5) %>%
ungroup() %>%
left_join(irish_partynames, by = "party") %>%
ggplot(aes(x = reorder(word, tf_idf), y = tf_idf, fill = partyname)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~partyname, ncol = 2, scales = "free") +
coord_flip()
## Selecting by tf_idf
One can see that the terms with high tf-idf scores differ across parties. Not surprisingly, the parties’ names or parts thereof appear often in these lists (as they are often used by the party, and rarely by other parties).
Make use of the codings (annotations)
The previous analyses did just make use of the machine-readable texts, but did not exploit the digital codings/annotations of the Manifesto Corpus. In this section, we will show how to use the tidytext package in conjunction with the annotations/codings of the Manifesto Corpus. In order to keep the codes for further analysis, it is necessary to first convert the ManifestoCorpus object to a data.frame and then use the unnest_tokens function (instead of using the tidy
function which will drop the codes). The pos
variable in the returned data frame comes from the content object of the Manifesto Corpus and indicates the position of the quasi-sentence within a ManifestoDocument. The following extract shows the quasi-sentences 50 to 51 in the Green Party manifesto (party id == 53110). For better readability, we did not remove stopwords here. One can see that quasi-sentence 50 was coded as 107 (Internationalism: positive), while the following quasi-sentence was coded as 501 (environmental protection: positive).
<- mp_corpus(countryname == "Ireland" & date == 201602) %>%
words_and_codes as.data.frame(with.meta = TRUE) %>%
unnest_tokens(word, text)
%>%
words_and_codes select(party, word, pos, cmp_code) %>%
filter(party == 53110 & between(pos, 50, 51))
## party word pos cmp_code
## 50 53110 we 50 107
## 50.1 53110 need 50 107
## 50.2 53110 to 50 107
## 50.3 53110 regain 50 107
## 50.4 53110 this 50 107
## 50.5 53110 spirit 50 107
## 50.6 53110 and 50 107
## 50.7 53110 this 50 107
## 50.8 53110 stance 50 107
## 50.9 53110 and 50 107
## 50.10 53110 act 50 107
## 50.11 53110 as 50 107
## 50.12 53110 an 50 107
## 50.13 53110 honest 50 107
## 50.14 53110 broker 50 107
## 50.15 53110 in 50 107
## 50.16 53110 all 50 107
## 50.17 53110 our 50 107
## 50.18 53110 multilateral 50 107
## 50.19 53110 engagements 50 107
## 51 53110 looking 51 501
## 51.1 53110 globally 51 501
## 51.2 53110 we 51 501
## 51.3 53110 will 51 501
## 51.4 53110 legislate 51 501
## 51.5 53110 for 51 501
## 51.6 53110 binding 51 501
## 51.7 53110 targets 51 501
## 51.8 53110 on 51 501
## 51.9 53110 climate 51 501
## 51.10 53110 change 51 501
## 51.11 53110 in 51 501
## 51.12 53110 line 51 501
## 51.13 53110 with 51 501
## 51.14 53110 the 51 501
## 51.15 53110 paris 51 501
## 51.16 53110 agreement 51 501
Now, we can simply filter based on the cmp_code, eg to either exclude some of the word occurencies from the analysis. One can also use the coding information to calculate tf-idf scores based on the different coding categories instead of based on the different documents. This should get us terms that are distinct and meaningful for the given categories. We first use remove stopwords and purely numeric values from the word list shown above and drop sentences coded as headlines (H), non-coded quasi-sentences or quasi-sentences coded as “0” (no particular meaning, cannot be coded). To reduce the complexity, we recode the categories coded according to version 5 of the coding instructions to the less complex coding scheme of version 4 (this aggregates several subcategories to the main categories - see the subcategories tutorial for more information). Then, we count and calculate tf-idf scores based on the word frequencies per coding category (instead of based on the frequencies per document).
<- words_and_codes %>%
tfidf_codes anti_join(get_stopwords()) %>%
filter(is.na(as.numeric(word))) %>%
filter(!(cmp_code %in% c("H", "", "0", "000", NA))) %>%
mutate(cmp_code = recode_v5_to_v4(cmp_code)) %>%
count(cmp_code, word) %>%
bind_tf_idf(word, cmp_code, n)
For illustrative purposes, we restrict the dataset to four codes: decentralisation (301), technology & infrastructure (411), environmental protection (501), and culture (502). We can see that the terms with high tf-idf scores seem very reasonable and make intuitive sense for the categories here (certainly, otherwise we wouldn’t have chosen this example…).
%>%
tfidf_codes filter(cmp_code %in% c("501", "502", "301", "411")) %>%
mutate(cmp_code = factor(cmp_code, labels = c("Decentralisation", "Technology & Infrastructure", "Environmental Protection", "Culture"))) %>%
group_by(cmp_code) %>%
top_n(10, tf_idf) %>%
ggplot(aes(x = reorder(word, tf_idf), y = tf_idf, fill = cmp_code)) +
geom_col(show.legend = FALSE) +
labs(x = NULL, y = "tf-idf") +
facet_wrap(~cmp_code, ncol = 2, scales = "free") +
coord_flip()
Tidytext provides many functions to convert to and from other text packages such as quanteda
or tm
. This was just a primer on how to use tidytext package (and philosophy) to use with the Manifesto Corpus. If you want to dig deeper into tidy text mining, we recommend the book Text Mining with R: A Tidy Approach" by Julia Silge and David Robinson.
Bibliography
Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10). doi:10.18637/jss.v059.i10 .
Silge, Julia, and David Robinson. 2016. “Tidytext: Text Mining and Analysis Using Tidy Data Principles in R.” The Journal of Open Source Software 1 (3). doi:10.21105/joss.00037.
Session Info
Tested with:
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.0.3 (2020-10-10)
## date 2021-06-15
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date lib source
## assertthat 0.2.0 2017-04-11 [NA] CRAN (R 4.0.3)
## base64enc 0.1-3 2015-07-28 [NA] CRAN (R 4.0.2)
## bookdown 0.22 2021-04-22 [NA] CRAN (R 4.0.2)
## cli 1.1.0 2019-03-19 [NA] CRAN (R 4.0.3)
## colorspace 1.3-2 2016-12-14 [NA] CRAN (R 4.0.3)
## crayon 1.3.4 2017-09-16 [NA] CRAN (R 4.0.2)
## curl 3.2 2018-03-28 [NA] CRAN (R 4.0.3)
## digest 0.6.21 2019-09-20 [NA] CRAN (R 4.0.3)
## dplyr * 1.0.6 2021-05-05 [NA] CRAN (R 4.0.2)
## DT 0.7 2019-06-11 [NA] CRAN (R 4.0.3)
## ellipsis 0.3.2 2021-04-29 [NA] CRAN (R 4.0.3)
## evaluate 0.14 2019-05-28 [NA] CRAN (R 4.0.1)
## fansi 0.4.0 2018-10-05 [NA] CRAN (R 4.0.3)
## farver 2.0.1 2019-11-13 [NA] CRAN (R 4.0.3)
## foreign 0.8-70 2018-04-23 [NA] CRAN (R 4.0.3)
## functional 0.6 2014-07-16 [NA] CRAN (R 4.0.2)
## generics 0.0.2 2018-11-29 [NA] CRAN (R 4.0.2)
## ggplot2 * 3.3.3 2020-12-30 [NA] CRAN (R 4.0.2)
## glue 1.4.2 2020-08-27 [NA] CRAN (R 4.0.2)
## gtable 0.2.0 2016-02-26 [NA] CRAN (R 4.0.3)
## highr 0.6 2016-05-09 [NA] CRAN (R 4.0.3)
## hms 0.4.2 2018-03-10 [NA] CRAN (R 4.0.3)
## htmltools 0.4.0 2019-10-04 [NA] CRAN (R 4.0.3)
## htmlwidgets 1.5.3 2020-12-10 [NA] CRAN (R 4.0.2)
## httr 1.3.1 2017-08-20 [NA] CRAN (R 4.0.3)
## janeaustenr 0.1.1 2016-06-20 [NA] CRAN (R 4.0.3)
## jsonlite 1.6 2018-12-07 [NA] CRAN (R 4.0.3)
## knitr 1.33 2021-04-24 [NA] CRAN (R 4.0.2)
## labeling 0.3 2014-08-23 [NA] CRAN (R 4.0.3)
## lattice 0.20-35 2017-03-25 [NA] CRAN (R 4.0.3)
## lifecycle 1.0.0 2021-02-15 [NA] CRAN (R 4.0.2)
## magrittr 2.0.1 2020-11-17 [NA] CRAN (R 4.0.2)
## manifestoR * 1.5.0 2020-11-29 [NA] CRAN (R 4.0.2)
## Matrix 1.2-14 2018-04-09 [NA] CRAN (R 4.0.3)
## mnormt 1.5-5 2016-10-15 [NA] CRAN (R 4.0.3)
## munsell 0.5.0 2018-06-12 [NA] CRAN (R 4.0.2)
## nlme 3.1-131 2017-02-06 [NA] CRAN (R 4.0.3)
## NLP * 0.1-9 2016-02-18 [NA] CRAN (R 4.0.3)
## pillar 1.6.1 2021-05-16 [NA] CRAN (R 4.0.2)
## pkgconfig 2.0.2 2018-08-16 [NA] CRAN (R 4.0.3)
## psych 1.8.3.3 2018-03-30 [NA] CRAN (R 4.0.3)
## purrr 0.3.2 2019-03-15 [NA] CRAN (R 4.0.3)
## R6 2.2.2 2017-06-17 [NA] CRAN (R 4.0.3)
## Rcpp 1.0.0 2018-11-07 [NA] CRAN (R 4.0.3)
## readr 1.3.1 2018-12-21 [NA] CRAN (R 4.0.3)
## rlang 0.4.10 2020-12-30 [NA] CRAN (R 4.0.2)
## rmarkdown 2.8 2021-05-07 [NA] CRAN (R 4.0.2)
## rmdformats 1.0.2 2021-04-19 [NA] CRAN (R 4.0.2)
## scales 1.1.0 2019-11-18 [NA] CRAN (R 4.0.3)
## sessioninfo 1.1.1 2018-11-05 [NA] CRAN (R 4.0.2)
## slam 0.1-40 2016-12-01 [NA] CRAN (R 4.0.3)
## SnowballC 0.5.1 2014-08-09 [NA] CRAN (R 4.0.3)
## stopwords 0.9.0 2017-12-14 [NA] CRAN (R 4.0.3)
## stringi 1.1.7 2018-03-12 [NA] CRAN (R 4.0.3)
## stringr 1.3.0 2018-02-19 [NA] CRAN (R 4.0.3)
## tibble 3.1.2 2021-05-16 [NA] CRAN (R 4.0.2)
## tidyselect 1.1.1 2021-04-30 [NA] CRAN (R 4.0.3)
## tidytext * 0.2.1 2019-06-14 [NA] CRAN (R 4.0.3)
## tm * 0.7-5 2018-07-29 [NA] CRAN (R 4.0.3)
## tokenizers 0.2.1 2018-03-29 [NA] CRAN (R 4.0.2)
## utf8 1.1.3 2018-01-03 [NA] CRAN (R 4.0.3)
## vctrs 0.3.8 2021-04-29 [NA] CRAN (R 4.0.3)
## withr 2.1.2 2018-03-15 [NA] CRAN (R 4.0.3)
## xfun 0.23 2021-05-15 [NA] CRAN (R 4.0.2)
## xml2 1.2.0 2018-01-24 [NA] CRAN (R 4.0.3)
## yaml 2.2.0 2018-07-25 [NA] CRAN (R 4.0.3)
## zoo 1.7-13 2016-05-03 [NA] CRAN (R 4.0.3)