Topic Modeling (Latent Dirichlet Modeling) using Project Gutenberg

For this topic modeling analysis, three works were analyzed: Pride and Prejudice by Jane Austen, The Fall of the House of Usher by Edgar Allan Poe and The War of the Worlds by H.G. Wells. Since each work has a distinct genre and explores specific themes, these three works were apt.

Pride and Prejudice charts character development of the Bennet daughters, especially Elizabeth Bennet, exploring themes like love, marriage, social structures, wealth and class. The Fall of the House of Usher sketches a Gothic, haunted house, exploring themes like isolation, fear and the supernatural. The War of the Worlds is a science fiction tale of humans' interaction and conflicts with extraterrestrial life, exploring themes like warfare, conflict, technology, fear and otherness.

Import packages

library(gutenbergr)
library(tidytext)
library(topicmodels)
library(dplyr)
library(ggplot2)

Retrieve text from Project Gutenberg

gutenberg_works(title == "Pride and Prejudice")

## # A tibble: 1 x 8
##   gutenberg_id title   author gutenberg_autho… language gutenberg_bookshe…
##          <int> <chr>   <chr>             <int> <chr>    <chr>             
## 1         1342 Pride … Auste…               68 en       Best Books Ever L…
## # ... with 2 more variables: rights <chr>, has_text <lgl>

gutenberg_works(title == "The Fall of the House of Usher")

## # A tibble: 1 x 8
##   gutenberg_id title    author  gutenberg_autho… language gutenberg_books…
##          <int> <chr>    <chr>              <int> <chr>    <chr>           
## 1          932 The Fal… Poe, E…              481 en       Movie Books     
## # ... with 2 more variables: rights <chr>, has_text <lgl>

gutenberg_works(title == "The War of the Worlds")

## # A tibble: 1 x 8
##   gutenberg_id title   author   gutenberg_autho… language gutenberg_books…
##          <int> <chr>   <chr>               <int> <chr>    <chr>           
## 1           36 The Wa… Wells, …               30 en       Movie Books/Sci…
## # ... with 2 more variables: rights <chr>, has_text <lgl>

literature <- gutenberg_download(c(1342, 932, 36), meta_fields = "title") # passing gutenberg_id
head(literature, 15)

## # A tibble: 15 x 3
##    gutenberg_id text                                       title          
##           <int> <chr>                                      <chr>          
##  1           36 The War of the Worlds                      The War of the…
##  2           36 ""                                         The War of the…
##  3           36 by H. G. Wells [1898]                      The War of the…
##  4           36 ""                                         The War of the…
##  5           36 ""                                         The War of the…
##  6           36 "     But who shall dwell in these worlds… The War of the…
##  7           36 "     inhabited? .  .  .  Are we or they … The War of the…
##  8           36 "     World? .  .  .  And how are all thi… The War of the…
##  9           36 "          KEPLER (quoted in The Anatomy … The War of the…
## 10           36 ""                                         The War of the…
## 11           36 ""                                         The War of the…
## 12           36 ""                                         The War of the…
## 13           36 BOOK ONE                                   The War of the…
## 14           36 ""                                         The War of the…
## 15           36 THE COMING OF THE MARTIANS                 The War of the…

Tokenize text

tokenize <- literature %>% unnest_tokens(word, text)
head(tokenize)

## # A tibble: 6 x 3
##   gutenberg_id title                 word  
##          <int> <chr>                 <chr>
## 1           36 The War of the Worlds the   
## 2           36 The War of the Worlds war   
## 3           36 The War of the Worlds of    
## 4           36 The War of the Worlds the   
## 5           36 The War of the Worlds worlds
## 6           36 The War of the Worlds by

Remove stop words

tokenize_stop <- tokenize %>%
  anti_join(stop_words, by = "word") %>%
  dplyr::count(title, word, sort = TRUE)

Cast text to Document Term Matrix (DTM) and applying Latent Dirichlet Allocation Model (LDA)

dtm <- tokenize_stop %>% cast_dtm(title, word, n) # casting tokenized text in Document Term Matrix (DTM)
LDM_model <- LDA(dtm, k = 3, control = list(seed = 1234)) # selecting k = 3 for three texts
result <- tidy(LDM_model, matrix = "beta")

Retrieve top 20 words from each topic

top_20_words <- result %>%
  group_by(topic) %>%
  top_n(20, beta) %>%
  ungroup() %>%
  arrange(topic, -beta)  %>%
  mutate(term = reorder(term, beta))

Plot top 20 words

top_20_words$topic <- factor(top_20_words$topic)
levels(top_20_words$topic)

## [1] "1" "2" "3"

top_20_words$topic <- plyr::revalue(top_20_words$topic, replace = c("1" = "1 ~ The War of the Worlds",
                                              "2" = "2 ~ Fall of the House of Usher",
                                              "3" = "3 ~ Pride and Prejudice"))

ggplot(top_20_words, aes(term, beta, fill = topic)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, scales = "free") + theme_bw() +
  coord_flip() + scale_fill_brewer(palette = "Set1") + theme(text = element_text(size = 18))

For this topic modeling analysis, three works were analyzed: Pride and Prejudice by Jane Austen, The Fall of the House of Usher by Edgar Allan Poe and The War of the Worlds by H.G. Wells. Since each work has a distinct genre and explores specific themes, these three works were apt.

Import packages

Retrieve text from Project Gutenberg

Tokenize text

Remove stop words

Cast text to Document Term Matrix (DTM) and applying Latent Dirichlet Allocation Model (LDA)

Retrieve top 20 words from each topic

Plot top 20 words

Part 6 of the Text Analysis series

Previous articles

Next articles