In this tutorial, we will show how to use the quanteda package to analyze the Manifesto Corpus. We assume that you have already read First steps with manifestoR (at least until “Downloading documents from the Manifesto Corpus”) and that you are familiar with the pipe %>% operator. The tutorial was written in 2018 based on the quanteda version available at that point in time but it has been slightly adapted in 2021 to be compatible with quanteda version >= 3.0 (which also means that it might be now less compatible with quanteda versions < 3.0).

Grammar and logic of quanteda

Quanteda (Benoit et al. 2018) is a comprehensive powerful text analysis R package. It is well documented, fast and versatile.
Quanteda has three main objects.

Corpora (created with the corpus()). These contain the texts() and document meta information in form of docvars().
tokens objects (created with tokens()). The tokens function tokenizes corpora into tokens. Tokens can be of different kind, such as words, paragraphs, or ngrams.
Document-feature matrices (created with dfm()). dfms are matrices where each row represents one document. Columns represent features (mostly tokens) and cells contain information about the occurence of features within documents. Features are mostly tokens (eg words or n-grams). dfm is the starting point for most types of analyses that draw inferences from the frequency of tokens.

Most quanteda functions take on of these three as an input and somehow transform it. The functions are consistently and intuitively named, eg. dfm_group groups a document-feature matrix, tokens_remove removes tokens from a tokens object, etc.

manifestoR and Quanteda

We first use the usual “header” of a manifestoR script: loading packages, setting the api-key and fixing the corpus version (to ensure reproducibility).

library(manifestoR)
library(quanteda)
library(quanteda.textstats)
library(dplyr)
library(ggplot2)
library(tidyr)
library(stringr)

mp_setapikey(key.file = "manifesto_apikey.txt")
mp_use_corpus_version("2017-2")

Before working with the Manifesto Corpus with Quanteda, it is important to think about the level of analysis (level of aggregation). Many operations on Quanteda are meant to happen on the document level. For example documents have metalevel information, while a smaller unit cannot have separate meta-information. Different documents however can have the same meta information (for example the party code or the same language). Depending on the research question it might sometimes be more appropriate to treat manifestos as documents and in other cases it might be better to treat individual quasi-sentences as documents.

Quanteda can directly import corpora from the manifestoR corpus format applying the corpus function to a normal ManifestoCorpus object that one can get with the mp_corpus function (which is a kind of tm corpus, see the First Steps with manifestoR tutorial). We use mp_availability to check the availability of documents for the 2012 US elections. We save the object and use it as input for the mp_corpus function. Alternatively, we could have also used the same expression for mp_corpus that we used for mp_availability.

available_us2012 <- mp_availability(countryname == "United States" & date == 201211 & partyname %in% c("Democratic Party", "Republican Party"))

## Connecting to Manifesto Project DB API... corpus version: 2017-2

available_us2012

##           Queried for        Corpus Version       Documents found 
##                     2                2017-2              2 (100%) 
## Coded Documents found       Originals found             Languages 
##              2 (100%)              2 (100%)           1 (english)

tm_corpus <- mp_corpus(available_us2012)

## Connecting to Manifesto Project DB API... corpus version: 2017-2

tm_corpus

## <<ManifestoCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 2

We queried for two documents of which both are “Coded document” - documents with annotations. When converting this to a Quanteda corpus, however this results in 3188 documents as every quasi-sentence is considered an individual document. As you can see in the code below we transformed the ManifestoCorpus into a data.frame and called quanteda’s corpus function with the parameters docid_field = "manifesto_id", unique_docnames = FALSE to let it auto-generate document names based on the manifesto_id and a within document running number. Alternatively you can also generate the doc_id column manually by adding a mutate(doc_id = paste(manifesto_id, pos, sep = ".")) step before calling quanteda’s corpus() function without any further arguments.

quanteda_corpus <- tm_corpus %>%
  as.data.frame(with.meta = TRUE) %>%
  corpus(docid_field = "manifesto_id", unique_docnames = FALSE) ## quanteda's corpus function
quanteda_corpus

## Corpus consisting of 3,188 documents and 18 docvars.
## 61320_201211.1 :
## "Moving America Forward 2012 Democratic National Platform"
## 
## 61320_201211.2 :
## "Moving America Forward"
## 
## 61320_201211.3 :
## "Four years ago, Democrats, independents, and many Republican..."
## 
## 61320_201211.4 :
## "We were in the midst of the greatest economic crisis since t..."
## 
## 61320_201211.5 :
## "the previous administration had put two wars on our nation’s..."
## 
## 61320_201211.6 :
## "and the American Dream had slipped out of reach for too many..."
## 
## [ reached max_ndoc ... 3,182 more documents ]

The meta data information from the Manifesto Corpus is stored in the docvars and is available for each quasi-sentence.

quanteda_corpus %>%
  docvars() %>%
  names()

##  [1] "cmp_code"                    "eu_code"                    
##  [3] "pos"                         "party"                      
##  [5] "date"                        "language"                   
##  [7] "source"                      "has_eu_code"                
##  [9] "is_primary_doc"              "may_contradict_core_dataset"
## [11] "md5sum_text"                 "url_original"               
## [13] "md5sum_original"             "annotations"                
## [15] "handbook"                    "is_copy_of"                 
## [17] "title"                       "id"

When using manifestos that were coded with different versions of the coding instructions (see the tutorial on subcategories), it might be a good idea to first recode version 5 codes to version 4 using manifestoR recode_v5_to_v4 function before transforming it into the quanteda corpus format.

corpus_df <- corpus
mp_corpus(countryname == "Germany" & date == 201709) %>%
  as.data.frame(with.meta = TRUE) %>%
  corpus(docid_field = "manifesto_id", unique_docnames = FALSE) %>%
  docvars(field = "cmp_code") %>%
  head(10)

##  [1] "H"     "0"     "202.1" "201.1" "503"   "201.1" "503"   "201.1" "201.1"
## [10] "501"

mp_corpus(countryname == "Germany" & date == 201709) %>%
  recode_v5_to_v4() %>%
  as.data.frame(with.meta = TRUE) %>%
  corpus(docid_field = "manifesto_id", unique_docnames = FALSE) %>%
  docvars(field = "cmp_code") %>%
  head(10)

##  [1] "H"   "0"   "202" "201" "503" "201" "503" "201" "201" "501"

By default, digitally annotated quasi-sentences will be treated as separate documents by quanteda (one document equals one quasi-sentence), while documents that have no annotations will be treated as a single document (one document equals one manifesto). The following snippet illustrates this difference. Instead of querying for the 2004 documents that are annotated, we query the manifestos from the 2000 election that are not annotated. The converted quanteda corpus then contains only two documents (including the whole texts of both manifestos):

us_not_annotated <- mp_availability(countryname == "United States" & date %in% c(200011))

## Connecting to Manifesto Project DB API... corpus version: 2017-2

us_not_annotated

##           Queried for        Corpus Version       Documents found 
##                     2                2017-2              2 (100%) 
## Coded Documents found       Originals found             Languages 
##                0 (0%)              2 (100%)           1 (english)

mp_corpus(as.data.frame(us_not_annotated)) %>%
  as.data.frame(with.meta = TRUE) %>%
  corpus(docid_field = "manifesto_id", unique_docnames = FALSE)

## Connecting to Manifesto Project DB API... corpus version: 2017-2

## Corpus consisting of 2 documents and 18 docvars.
## 61320_200011 :
## "The 2000 Democratic National Platform: Prosperity, Progress,..."
## 
## 61620_200011 :
## "REPUBLICAN PLATFORM 2000 Renewing America's Purpose. Togethe..."

If you want to use a set of manifestos where one part of the set is available as annotated and the other as non-annotated documents, it might be reasonable to first transform them to a similar aggregation level. One possibility would be to separately download the non-annotated manifestos and segment them using corpus_segment() into sentences and then combine them with a corpus that is already parsed into quasi-sentences. If the level of analysis is anyway the manifesto level, then one can also later group the document-feature matrix based on the manifesto id using dfm_group.

Subsetting the corpus

Quanteda can easily subset the corpus based on document level variables. The following code snippets subset the corpus based on the party code or based on the cmp_code.

quanteda_corpus %>%
  corpus_subset(party == 61620) %>%
  as.character() %>%
  head(5)

##                                                                             61620_201211.1 
##                                                                    "We Believe in America" 
##                                                                             61620_201211.2 
##                          "This platform is dedicated with appreciation and reverence for:" 
##                                                                             61620_201211.3 
##                                                                                 "Preamble" 
##                                                                             61620_201211.4 
## "The 2012 Republican Platform is a statement of who we are and what we believe as a Party" 
##                                                                             61620_201211.5 
##                                         "and our vision for a stronger and freer America."

quanteda_corpus %>%
  corpus_subset(cmp_code == 501) %>%
  as.character() %>%
  head(5)

##                                                                                                                        61320_201211.1 
##                                                                                         "and fuel-efficiency standards are doubling." 
##                                                                                                                        61320_201211.2 
##                    "Historic investments in clean energy technologies have helped double the electricity we get from wind and solar." 
##                                                                                                                        61320_201211.3 
##                                             "New emissions and fuel efficiency standards for American cars are reducing our oil use," 
##                                                                                                                        61320_201211.4 
## "which is why the Obama administration has proposed a number of safeguards to protect against water contamination and air pollution." 
##                                                                                                                        61320_201211.5 
##                                                                 "We will continue to advocate for the use of this clean fossil fuel,"

Tokenization with `tokens()`

The tokens() function in quanteda tokenizes the documents. Tokens can be words (the default), characters, sentences or ngrams. The function provides many arguments to facilitate the cleaning and preprocessing.

quanteda_corpus %>%
  tokens() %>%
  head(2)

## Tokens consisting of 2 documents and 18 docvars.
## 61320_201211.1 :
## [1] "Moving"     "America"    "Forward"    "2012"       "Democratic"
## [6] "National"   "Platform"  
## 
## 61320_201211.2 :
## [1] "Moving"  "America" "Forward"

One could also tokenize the same document into bi-grams using the tokens_ngrams function. By using n = 1:2, quanteda tokenizes into uni-grams and bi-grams.

quanteda_corpus %>%
  tokens() %>%
  tokens_ngrams(n = 1:2) %>%
  head(2)

## Tokens consisting of 2 documents and 18 docvars.
## 61320_201211.1 :
##  [1] "Moving"              "America"             "Forward"            
##  [4] "2012"                "Democratic"          "National"           
##  [7] "Platform"            "Moving_America"      "America_Forward"    
## [10] "Forward_2012"        "2012_Democratic"     "Democratic_National"
## [ ... and 1 more ]
## 
## 61320_201211.2 :
## [1] "Moving"          "America"         "Forward"         "Moving_America" 
## [5] "America_Forward"

Tokenization is particularly important for pre-processing and cleaning the texts. One can easily remove nubmers, punctuation or stopwords. Moreover, it is simple to transform all text to lower case or stem words.

quanteda_corpus %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
  tokens_tolower() %>%
  tokens_remove(stopwords("english")) %>%
  tokens_wordstem() %>%
  head(4)

## Tokens consisting of 4 documents and 18 docvars.
## 61320_201211.1 :
## [1] "move"     "america"  "forward"  "democrat" "nation"   "platform"
## 
## 61320_201211.2 :
## [1] "move"    "america" "forward"
## 
## 61320_201211.3 :
##  [1] "four"       "year"       "ago"        "democrat"   "independ"  
##  [6] "mani"       "republican" "came"       "togeth"     "american"  
## [11] "move"       "countri"   
## [ ... and 1 more ]
## 
## 61320_201211.4 :
## [1] "midst"    "greatest" "econom"   "crisi"    "sinc"     "great"    "depress"

Constructing a document-feature-matrix with `dfm`

The construction of a document feature matrix is at the core of most automatic text analyses workflows. dfm is quanteda’s powerful command to construct a document-feature matrix. In many cases, one can skip the step to generate tokens from a corpus, but directly use dfm on a corpus as the dfm command passes on most arguments to tokens. To get a “standard” preprocessed document feature matrix with lower casing, removed punctuation and numbers as well as stemmed words from a corpus, one would add the following arguments to dfm:

quanteda_corpus %>%
  tokens(remove_punct = TRUE, remove_numbers = TRUE) %>%
  tokens_tolower() %>%
  tokens_remove(stopwords("english")) %>%
  tokens_wordstem() %>%
  dfm()

## Document-feature matrix of: 3,188 documents, 3,941 features (99.75% sparse) and 18 docvars.
##                 features
## docs             move america forward democrat nation platform four year ago
##   61320_201211.1    1       1       1        1      1        1    0    0   0
##   61320_201211.2    1       1       1        0      0        0    0    0   0
##   61320_201211.3    1       0       1        1      0        0    1    1   1
##   61320_201211.4    0       0       0        0      0        0    0    0   0
##   61320_201211.5    0       0       0        0      1        0    0    0   0
##   61320_201211.6    0       0       0        0      0        0    0    0   0
##                 features
## docs             independ
##   61320_201211.1        0
##   61320_201211.2        0
##   61320_201211.3        1
##   61320_201211.4        0
##   61320_201211.5        0
##   61320_201211.6        0
## [ reached max_ndoc ... 3,182 more documents, reached max_nfeat ... 3,931 more features ]

You can modify a dfm by using various functions such as dfm_trim, dfm_select, dfm_weight, dfm_keep, dfm_lookup, dfm_sample, and many more. In the following example, we download Irish manifestos from the 2016 election, do some standard preprocessing, drop all quasi-sentences with headline codes (“H”), uncoded (“0”,“000”) and with codes missing (NA). We use the dfm_group here to combine all quasi-sentences coded with the same code to one document.Standard cell entries in a dfm are counts of features per document. Term frequencies can be transformed using the dfm_weight function. Here, we use it to calculate the proportion of words per document (scheme = "prop"). We then subset the dfm for four features of four specific codes.

quanteda_irish <- mp_corpus(countryname == "Ireland" & date == 201602) %>%
  recode_v5_to_v4() %>%
  as.data.frame(with.meta = TRUE) %>%
  corpus(docid_field = "manifesto_id", unique_docnames = FALSE) %>%
  tokens(remove_punct = TRUE) %>%
  tokens_tolower() %>%
  tokens_remove(stopwords("english")) %>%
  dfm() %>%
  dfm_subset(!(cmp_code %in% c("H", "", "0", "000", NA))) %>%
  dfm_group(cmp_code) %>%
  dfm_weight(scheme = "prop") %>%
  dfm_subset(cmp_code %in% c("501", "502", "301", "411"))

## Connecting to Manifesto Project DB API... corpus version: 2017-2 
## Connecting to Manifesto Project DB API... corpus version: 2017-2

quanteda_irish

## Document-feature matrix of: 4 documents, 13,161 features (86.26% sparse) and 11 docvars.
##      features
## docs  think        ahead          act          now      general     election
##   301     0 0            0.0009632055 0.0003852822 0            0.0003852822
##   411     0 0.0003094059 0.0004641089 0.0004641089 0            0           
##   501     0 0            0.0011047980 0.0007891414 0            0.0001578283
##   502     0 0            0.0015507883 0.0002584647 0.0002584647 0           
##      features
## docs     manifesto         2016  progressive    practical
##   301 0.0001926411 0.0003852822 0            0           
##   411 0            0.0012376238 0.0001547030 0.0003094059
##   501 0            0.0006313131 0.0001578283 0.0001578283
##   502 0            0.0015507883 0.0002584647 0           
## [ reached max_nfeat ... 13,151 more features ]

To plot the most frequent terms, we use the textstat_frequency() function. It extracts the most frequent terms (here grouped by cmp_code) and converts these summary statistics to a data.frame. Such a data.frame servers as perfect input for a ggplot.

feature_frequencies_categories <- quanteda_irish %>% textstat_frequency(n = 10, group = cmp_code)

feature_frequencies_categories %>%
  mutate(cmp_code = factor(group, labels = c("Decentralisation", "Technology & Infrastructure", "Environmental Protection", "Culture"))) %>%
  ggplot(aes(x = reorder(feature, frequency), y = frequency, fill = cmp_code)) +
  geom_col(show.legend = FALSE) +
  labs(x = NULL, y = "share of words per category") +
  facet_wrap(~cmp_code, ncol = 2, scales = "free") +
  coord_flip()

Certainly, similar to tidytext, quanteda also allows the calculation of term-frequency inverse-document frequency (tfidf) scores with dfm_tfidf.

Keyword in context search

Quanteda also provides a nice way to view text passages based on certain key words. The kwic (for keyword in context) allows you to use for a text string or pattern. The window indicates how many word around the keyword should be shown in the output. The following is a keyword search for the term “arms” based on the US party platforms.

quanteda_corpus %>%
  tokens() %>%
  kwic(phrase("arms"), window = 10) %>%
  DT::datatable(caption = "Keywords in context", rownames = FALSE, options = list(scrollX = TRUE, pageLength = 5, lengthMenu = c(5, 10, 15, 20)))

Multi-word expressions

Multi-word expressions can pose a problem to automatic text analysis. The expression “New York” stands for something different than the two separate words “new” and “York”. Quanteda offers a simple way to identify such multi-word expressions based on collocations using the textstat_collocations function. The following shows an association measure for word pairs. The list contains many expressions that may be better (or even should) be treated as one expression in automatic analysis, such as “United States”, “President Obama”…

quanteda_corpus %>%
  tokens() %>%
  tokens_remove(stopwords("english")) %>%
  textstat_collocations(method = "lambda", size = 2) %>%
  arrange(-lambda) %>%
  top_n(20)

## Selecting by z

##                 collocation count count_nested length   lambda        z
## 17          president obama   106            0      2 8.710226 15.82879
## 18             job creation    21            0      2 8.624032 15.40948
## 2          democratic party    51            0      2 7.881325 23.77678
## 1             united states    80            0      2 7.558881 24.56679
## 6            private sector    26            0      2 7.442870 19.31141
## 16          nuclear weapons    18            0      2 7.425171 16.07347
## 8          small businesses    20            0      2 7.313704 18.24845
## 19 current administration's    21            0      2 7.141155 15.40121
## 12             clean energy    22            0      2 7.093708 16.74232
## 11             around world    20            0      2 6.614726 16.97897
## 20          obama democrats    17            0      2 5.896752 15.37306
## 3    current administration    34            0      2 5.813144 22.34827
## 5               health care    30            0      2 5.620318 21.20490
## 14          economic growth    18            0      2 5.517368 16.23987
## 15         health insurance    16            0      2 5.478452 16.15043
## 4         national security    33            0      2 5.136260 21.30416
## 9          obama democratic    22            0      2 5.047492 17.81549
## 13         republican party    18            0      2 4.882577 16.53921
## 7        federal government    31            0      2 4.272125 19.00898
## 10          american people    29            0      2 3.831366 17.08902

Targeted sentiment analysis

Quanteda facilitates dictionary based searchs. The following example illustrates how to conduct a targeted sentiment analysis. We use the corpus created above based on US party platforms of 2012 and tokenize it into words. We then keep only tokens that include the word “President” as well as the ten words before and after every occurence of “President”.

pres_tokens <- tokens(quanteda_corpus) %>%
  tokens_select("President", selection = "keep", window = 10, padding = FALSE, verbose = TRUE)

## kept 2 features

Quanteda has integrated a sentiment dictionary constructed by Young & Soroka (2012) stored in data_dictionary_LSD2015. The dictionary contains thousands of positive and negative words or word stems.

data_dictionary_LSD2015[[1]] %>% head(10)

##  [1] "a lie"     "abandon*"  "abas*"     "abattoir*" "abdicat*"  "aberra*"  
##  [7] "abhor*"    "abject*"   "abnormal*" "abolish*"

We then use the the sentiment dictionary to count positive and negative words among the surrounding words of “President” to analyze which party speaks more positively or negatively about the president. We “group” by party to get frequencies of positive and negative words aggregated to the party level. The ratio of positive to negative words is much higher for the Democratic Party (61320) than for the Republican Party (61620) when speaking about the “President”. This is little surprising as in 2012, the incumbent President was a Democrat.

pres_dfm <- dfm(pres_tokens) %>%
  dfm_lookup(data_dictionary_LSD2015[1:2]) %>%
  dfm_group(party)
pres_dfm

## Document-feature matrix of: 2 documents, 2 features (0.00% sparse) and 16 docvars.
##        features
## docs    negative positive
##   61320       64      181
##   61620       30       45

Quanteda (and its “sister”-packages quanteda.textstats, quanteda.textplots, quanteda.textmodels) has many more functions. In particular the textstat_* functions of quanteda.textstats are powerful and can well applied to manifestos.

References

Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, Matsuo A (2018). “quanteda: An R package for the quantitative analysis of textual data.” Journal of Open Source Software, 3(30), 774. doi: 10.21105/joss.00774, https://quanteda.io.

Young, L and S Soroka. (2012). “Affective News: The Automated Coding of Sentiment in Political Texts.” Political Communication 29(2): 205-231.

Session Info

Tested with:

## ─ Session info ───────────────────────────────────────────────────────────────
##  setting value                       
##  version R version 4.0.3 (2020-10-10)
##  date    2021-06-15                  
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package            * version    date       lib  source        
##  assertthat           0.2.0      2017-04-11 [NA] CRAN (R 4.0.3)
##  base64enc            0.1-3      2015-07-28 [NA] CRAN (R 4.0.2)
##  bookdown             0.22       2021-04-22 [NA] CRAN (R 4.0.2)
##  cli                  1.1.0      2019-03-19 [NA] CRAN (R 4.0.3)
##  colorspace           1.3-2      2016-12-14 [NA] CRAN (R 4.0.3)
##  crayon               1.3.4      2017-09-16 [NA] CRAN (R 4.0.2)
##  crosstalk            1.0.0      2016-12-21 [NA] CRAN (R 4.0.3)
##  curl                 3.2        2018-03-28 [NA] CRAN (R 4.0.3)
##  digest               0.6.21     2019-09-20 [NA] CRAN (R 4.0.3)
##  dplyr              * 1.0.6      2021-05-05 [NA] CRAN (R 4.0.2)
##  DT                   0.7        2019-06-11 [NA] CRAN (R 4.0.3)
##  ellipsis             0.3.2      2021-04-29 [NA] CRAN (R 4.0.3)
##  evaluate             0.14       2019-05-28 [NA] CRAN (R 4.0.1)
##  fansi                0.4.0      2018-10-05 [NA] CRAN (R 4.0.3)
##  farver               2.0.1      2019-11-13 [NA] CRAN (R 4.0.3)
##  fastmap              1.0.0      2019-07-28 [NA] CRAN (R 4.0.3)
##  fastmatch            1.1-0      2017-01-28 [NA] CRAN (R 4.0.2)
##  foreign              0.8-70     2018-04-23 [NA] CRAN (R 4.0.3)
##  functional           0.6        2014-07-16 [NA] CRAN (R 4.0.2)
##  generics             0.0.2      2018-11-29 [NA] CRAN (R 4.0.2)
##  ggplot2            * 3.3.3      2020-12-30 [NA] CRAN (R 4.0.2)
##  glue                 1.4.2      2020-08-27 [NA] CRAN (R 4.0.2)
##  gtable               0.2.0      2016-02-26 [NA] CRAN (R 4.0.3)
##  highr                0.6        2016-05-09 [NA] CRAN (R 4.0.3)
##  hms                  0.4.2      2018-03-10 [NA] CRAN (R 4.0.3)
##  htmltools            0.4.0      2019-10-04 [NA] CRAN (R 4.0.3)
##  htmlwidgets          1.5.3      2020-12-10 [NA] CRAN (R 4.0.2)
##  httpuv               1.5.2      2019-09-11 [NA] CRAN (R 4.0.3)
##  httr                 1.3.1      2017-08-20 [NA] CRAN (R 4.0.3)
##  ISOcodes             2018.06.29 2018-06-30 [NA] CRAN (R 4.0.3)
##  jsonlite             1.6        2018-12-07 [NA] CRAN (R 4.0.3)
##  knitr                1.33       2021-04-24 [NA] CRAN (R 4.0.2)
##  labeling             0.3        2014-08-23 [NA] CRAN (R 4.0.3)
##  later                1.0.0      2019-10-04 [NA] CRAN (R 4.0.3)
##  lattice              0.20-35    2017-03-25 [NA] CRAN (R 4.0.3)
##  lifecycle            1.0.0      2021-02-15 [NA] CRAN (R 4.0.2)
##  magrittr             2.0.1      2020-11-17 [NA] CRAN (R 4.0.2)
##  manifestoR         * 1.5.0      2020-11-29 [NA] CRAN (R 4.0.2)
##  Matrix               1.2-14     2018-04-09 [NA] CRAN (R 4.0.3)
##  mime                 0.5        2016-07-07 [NA] CRAN (R 4.0.3)
##  mnormt               1.5-5      2016-10-15 [NA] CRAN (R 4.0.3)
##  munsell              0.5.0      2018-06-12 [NA] CRAN (R 4.0.2)
##  nlme                 3.1-131    2017-02-06 [NA] CRAN (R 4.0.3)
##  NLP                * 0.1-9      2016-02-18 [NA] CRAN (R 4.0.3)
##  nsyllable            1.0        2020-11-30 [NA] CRAN (R 4.0.2)
##  pillar               1.6.1      2021-05-16 [NA] CRAN (R 4.0.2)
##  pkgconfig            2.0.2      2018-08-16 [NA] CRAN (R 4.0.3)
##  promises             1.1.0      2019-10-04 [NA] CRAN (R 4.0.3)
##  proxyC               0.2.0      2021-05-11 [NA] CRAN (R 4.0.2)
##  psych                1.8.3.3    2018-03-30 [NA] CRAN (R 4.0.3)
##  purrr                0.3.2      2019-03-15 [NA] CRAN (R 4.0.3)
##  quanteda           * 3.0.0      2021-04-06 [NA] CRAN (R 4.0.2)
##  quanteda.textstats * 0.94.1     2021-05-11 [NA] CRAN (R 4.0.2)
##  R6                   2.2.2      2017-06-17 [NA] CRAN (R 4.0.3)
##  Rcpp                 1.0.0      2018-11-07 [NA] CRAN (R 4.0.3)
##  RcppParallel         5.1.4      2021-05-04 [NA] CRAN (R 4.0.2)
##  readr                1.3.1      2018-12-21 [NA] CRAN (R 4.0.3)
##  rlang                0.4.10     2020-12-30 [NA] CRAN (R 4.0.2)
##  rmarkdown            2.8        2021-05-07 [NA] CRAN (R 4.0.2)
##  rmdformats           1.0.2      2021-04-19 [NA] CRAN (R 4.0.2)
##  scales               1.1.0      2019-11-18 [NA] CRAN (R 4.0.3)
##  sessioninfo          1.1.1      2018-11-05 [NA] CRAN (R 4.0.2)
##  shiny                1.4.0      2019-10-10 [NA] CRAN (R 4.0.3)
##  slam                 0.1-40     2016-12-01 [NA] CRAN (R 4.0.3)
##  SnowballC            0.5.1      2014-08-09 [NA] CRAN (R 4.0.3)
##  stopwords            0.9.0      2017-12-14 [NA] CRAN (R 4.0.3)
##  stringi              1.1.7      2018-03-12 [NA] CRAN (R 4.0.3)
##  stringr            * 1.3.0      2018-02-19 [NA] CRAN (R 4.0.3)
##  tibble               3.1.2      2021-05-16 [NA] CRAN (R 4.0.2)
##  tidyr              * 0.8.0      2018-01-29 [NA] CRAN (R 4.0.3)
##  tidyselect           1.1.1      2021-04-30 [NA] CRAN (R 4.0.3)
##  tm                 * 0.7-5      2018-07-29 [NA] CRAN (R 4.0.3)
##  utf8                 1.1.3      2018-01-03 [NA] CRAN (R 4.0.3)
##  vctrs                0.3.8      2021-04-29 [NA] CRAN (R 4.0.3)
##  withr                2.1.2      2018-03-15 [NA] CRAN (R 4.0.3)
##  xfun                 0.23       2021-05-15 [NA] CRAN (R 4.0.2)
##  xml2                 1.2.0      2018-01-24 [NA] CRAN (R 4.0.3)
##  xtable               1.8-2      2016-02-05 [NA] CRAN (R 4.0.3)
##  yaml                 2.2.0      2018-07-25 [NA] CRAN (R 4.0.3)
##  zoo                  1.7-13     2016-05-03 [NA] CRAN (R 4.0.3)

Using the Manifesto Corpus with quanteda

Nicolas Merz nicolas.merz@wzb.eu

19 June 2018 (slightly updated 14 June 2021 for quanteda 3.0 compatibilty)

Grammar and logic of quanteda

manifestoR and Quanteda

Subsetting the corpus

Tokenization with `tokens()`

Constructing a document-feature-matrix with `dfm`

Keyword in context search

Multi-word expressions

Targeted sentiment analysis

References

Session Info

Using the Manifesto Corpus with quanteda

Nicolas Merz nicolas.merz@wzb.eu

19 June 2018 (slightly updated 14 June 2021 for quanteda 3.0 compatibilty)

Grammar and logic of quanteda

manifestoR and Quanteda

Subsetting the corpus

Tokenization with tokens()

Constructing a document-feature-matrix with dfm

Keyword in context search

Multi-word expressions

Targeted sentiment analysis

References

Session Info

Tokenization with `tokens()`

Constructing a document-feature-matrix with `dfm`