Introduction

Data

This data comes from Inside Airbnb, which is “an independent, non-commercial set of tools and data that allows you to explore how Airbnb is really being used in cities around the world” (Inside Airbnb). As with other text mining exploratory projects, some of the challenges involve determining was information one can feasibly extract and quantify from a mass text corpus. Techniques usually employed include term freqency and topic modeling, which are methods that involve a lot of exploratory trial and error. This project is a dive into this initial section of my larger project, and involves such trial and error.

In order to spend a larger amount of time on the actual analysis of the text from Airbnb listings and reviews, the data I have chosen to work with has already been cleaned and set up by Inside Airbnb. It comes in the format of two separate compressed CSV, one for reviews and one for listings.

Setting Up and Bringing in Data

library(tm)
library(wordcloud)
library(qdap)
library(topicmodels)
library(dplyr)
library(Rgraphviz)
library(yaml)
library(readr)
library(magrittr)
library(wesanderson)
library(ggplot2)
library(cluster)
library(fpc)
listings <- read_csv("listings.csv")
Error: 'listings.csv' does not exist in current working directory ('C:/Users/racha/Desktop/Git/dv-final-project/R').

Description of the Data

Listings

The listings data includes 95 variables and 40,227 observations. Some relevant variables:

Variables Description
last_scraped date of day data was scraped from Airbnb website, for project reference.
summary listing summary, text blob, will be useful for most text mining
space text blob of descriptions of space, particularly useful for my interest in the changing New York environment
description text blob, describing space and type of stay, relevant details
neighborhood_overview text blob of Airbnb listing description of neighborhood, particularly interesting for my work on neighborhood and place perception
transit lister’s text blob description of transit access
host_id Host ID
host_neighborhood Airbnb labeled neighborhoods
host_about description of host written by host
host_total_listings_count Number of listings owned and operated by single host
neighborhood Listing location
city City
property_type Type of listing (Apartment, House, etc.)
number_of_reviews Number of total reviews
review_per_month Number of reviews per month for listing
availability_365 Number of days available in coming year
combinedData <- join(reviews, listings_tm, by='id', type='left', match='all')
Error in join(reviews, listings_tm, by = "id", type = "left", match = "all") : 
  unused arguments (by = "id", type = "left", match = "all")
pal_3 <- wes_palette("Moonrise3", type = "continuous")
Error: could not find function "wes_palette"
findAssocs(bk_tdm_2, "cozy", corlimit = 0.05)
$cozy
       place        clean         home neighborhood         room 
        0.07         0.06         0.06         0.05         0.05 
d<- as.data.frame(findFreqTerms(dtm_rev1, lowfreq = 10000))
m<- as.data.frame(findFreqTerms(dtm_sum, lowfreq = 1000))

plyr::rename(d, c("'findFreqTerms(dtm_sum, lowfreq = 1000)'" = "listing_freqterms"))

d$freq_words <- d$`findFreqTerms(dtm_rev1, lowfreq = 10000)`

d_1 <- as.data.frame(d$freq_words)

m$freq_words <- m$`findFreqTerms(dtm_sum, lowfreq = 1000)`
m_1 <- as.data.frame(m$freq_words)
write.csv(m_1, "listings_mostfrequentwords.csv")
write.csv(d_1, "reviews_mostfrequentwords.csv")
findAssocs(dtm_sum, terms = "neighborhood", corlimit = 0.1)

reviews <- findAssocs(dtm_rev1, terms = "neighborhood", corlimit = 0.1)


tdm_sum_bk

word <- c("safe", "quiet", "restaurants", "apartment", "felt", "subway", "around", "nice")
cor <- c(0.23, 0.15, 0.13, 0.12, 0.12, 0.11, 0.1, 0.1)


rev_neigh <-as.data.frame(cbind(word, cor)) 
rev_neigh

write.csv(rev_neigh, "review_neighborhood_wordcorr.csv")


list_neigh <- c("safe", "love", "place", "adventurers", "restaurants", "solo", "ambiance", "couples", "good", "outdoors", "bars", 
                "business", "close", "travelers", "diverse")
list_cor <- c(0.21, 0.17, 0.15, 0.13, 0.13, 0.13, 0.12, 0.12, 0.12, 0.12, 0.11, 0.11, 0.11, 0.11, 0.1)

list_neighb <- as.data.frame(cbind(list_neigh, list_cor))
write.csv(list_neighb, "listing_neighborhood_wordcorr.csv")
findAssocs(tdm_sum_si, terms = "cozy", corlimit = 0.1)
$cozy
              movies                banks               amazon               assure               bnbers             boroughs 
                0.36                 0.33                 0.32                 0.32                 0.32                 0.32 
             chinese                 cool               donuts               dunkin               famous                fancy 
                0.32                 0.32                 0.32                 0.32                 0.32                 0.32 
                hulu              italian            landlords               locale             longtime               loving 
                0.32                 0.32                 0.32                 0.32                 0.32                 0.32 
           mcdonalds            privately             railroad             roomates         supermarkets            unlimited 
                0.32                 0.32                 0.32                 0.32                 0.32                 0.32 
                fast               access           accessible                  cvs          environment            exquisite 
                0.31                 0.30                 0.28                 0.28                 0.28                 0.28 
                food                 iron              netflix                 trip              venture                 feel 
                0.28                 0.28                 0.28                 0.28                 0.28                 0.26 
                stay                  art              culture              express       familyfriendly             business 
                0.25                 0.24                 0.24                 0.24                 0.24                 0.23 
              couple                 hour                  lot             roadways              schools           activities 
                0.23                 0.23                 0.23                 0.23                 0.23                 0.22 
          everything                 five                happy                  get             internet              minutes 
                0.22                 0.22                 0.22                 0.21                 0.20                 0.20 
              plenty            amenities               entire             friendly                 make                pizza 
                0.20                 0.19                 0.19                 0.19                 0.19                 0.19 
      transportation               nearby                 need                 take                 wifi                 goes 
                0.19                 0.18                 0.18                 0.18                 0.18                 0.17 
       international                 like                major            manhattan              premium                 solo 
                0.17                 0.17                 0.17                 0.17                 0.17                 0.17 
                town                 warm              <U+0433><U+043E><U+0432><U+043E><U+0440><U+0438><U+043C>               <U+0440><U+0443><U+0441><U+0441><U+043A><U+0438>         accommodates           adventurer 
                0.17                 0.17                 0.16                 0.16                 0.16                 0.16 
           alkalized                andor                  atm         availability              balcony                  bar 
                0.16                 0.16                 0.16                 0.16                 0.16                 0.16 
           beachside               belong             bookings            brickhome             bungalow               buscar 
                0.16                 0.16                 0.16                 0.16                 0.16                 0.16 
          businesses                 buss                 calm               carpet              chaplin              charlie 
                0.16                 0.16                 0.16                 0.16                 0.16                 0.16 
              church               cityit              clients                 cold         conditioning             connects 
                0.16                 0.16                 0.16                 0.16                 0.16                 0.16 
                deal             drinking           economical             enclosed        entertainment                 fame 
                0.16                 0.16                 0.16                 0.16                 0.16                 0.16 
            filtered                 fios             flooring                 girl               grassy                 hair 
                0.16                 0.16                 0.16                 0.16                 0.16                 0.16 
             harbour              highway            hollywood                homey          individuals                  jfk 
                0.16                 0.16                 0.16                 0.16                 0.16                 0.16 
           laguardia              leather            loftstyle            luxurious                mabel                mable 
                0.16                 0.16                 0.16                 0.16                 0.16                 0.16 
               males              message                month             nautical              nearest               normal 
                0.16                 0.16                 0.16                 0.16                 0.16                 0.16 
             normand               office             pastries              persian             pictures               played 
                0.16                 0.16                 0.16                 0.16                 0.16                 0.16 
           poolcheck             positive              present               privat             produced            questions 
                0.16                 0.16                 0.16                 0.16                 0.16                 0.16 
          references           requesting            reservoir restaurantscafesbars               retail      roomkitchenette 
                0.16                 0.16                 0.16                 0.16                 0.16                 0.16 
             russian              seeking             services                sofas              spanish             sunporch 
                0.16                 0.16                 0.16                 0.16                 0.16                 0.16 
              target             theaters          traditional           trainferry               tuscan                 used 
                0.16                 0.16                 0.16                 0.16                 0.16                 0.16 
                ways            wideplank          worldfamous                youre                  bus             spacious 
                0.16                 0.16                 0.16                 0.16                 0.15                 0.15 
               close                comfy                quiet          restaurants            transport            travelers 
                0.14                 0.14                 0.14                 0.14                 0.14                 0.14 
            external              friends                  nyc                relax               yankee          adventurers 
                0.13                 0.13                 0.13                 0.13                 0.13                 0.12 
              across               dining             includes            including                lined                  new 
                0.11                 0.11                 0.11                 0.11                 0.11                 0.11 
             stadium               staten                water                 safe 
                0.11                 0.11                 0.11                 0.10 

Initial Plot of Word type

To begin, I’ve just run a simple ggplot to take a look at the distribution of words and word size, which helps to verify that the data is properly formatted.



list_manhattancor <- c("minutes", "ride", "brooklyn", "skyline", "train", "midtown", "subway", "williamsburg", "downtown", "min",
                       "stop", "away", "mins")

list_manhattancor1 <- c(0.22, 0.16, 0.15, 0.15, 0.15, 0.14, 0.14, 0.14, 0.14, 0.13, 0.13, 0.13, 0.1, 0.1)


dist_tab(nchar(colnames(dtm_sum)))

Here’s one way to find the top N terms in a document term matrix. Briefly, you convert the dtm to a matrix, then sort by row sums:

# load text mining library    
library(tm)

# make corpus for text mining (data comes from package, for reproducibility) 
data("crude")
corpus <- Corpus(VectorSource(crude))

# process text (your methods may differ)
skipWords <- function(x) removeWords(x, stopwords("english"))
funcs <- list(tolower, removePunctuation, removeNumbers, stripWhitespace, skipWords)
a <- tm_map(corpus, FUN = tm_reduce, tmFuns = funcs)
a.dtm1 <- TermDocumentMatrix(a, control = list(wordLengths = c(3,10))) 


N <- 10
findFreqTerms(dtm_sum, 10000)


m <- as.matrix(dtm_sum)
v <- sort(rowSums(m), decreasing=TRUE)
head(v, N)

data.frame(nletters = nchar(colnames(dtm_sum))) %>%
ggplot(aes(x = nletters)) + geom_histogram(binwidth = 1) +
geom_vline(xintercept = mean(nchar(colnames(dtm_sum))), 
           color = "blue", size = 1, alpha = .5) +
labs(x = "Number of Letters", y = "Number of Words") + xlim(c(0, 30))

# Calculate word frequency 

install.packages("slam")
library(slam)
# dtm_sum <- rollup(dtm_sum, 2, na.rm=TRUE, FUN = sum)


# Brooklyn
dtm_sum_1

slam::row_sums(dtm_sum, na.rm = T)
slam::col_sums(tdm, na.rm = T)


freq <- colSums(as.matrix(dtm_sum_mn)) 
len_dtm <- freq %>% length()

dtm_sum < removeSparseTerms(dtm_sum, 0.9999)
# Word matrix length (no words included in word frequency) 
len_dtm
wfreq <- as.data.frame(freq)
wfreq <- setNames(cbind(rownames(wfreq), wfreq, row.names = NULL), 
         c("word", "freq"))

wfreq_manhattan <- wfreq %>% filter(freq > 1)
write.csv(wfreq_manhattan, "word_frequency_manhattan.csv")

#word brooklyn
freqbk <- colSums(as.matrix(dtm_sum_bk)) 
len_dtm <- freqbk %>% length()

# Word matrix length (no words included in word frequency) 
len_dtm
wfreqbk <- as.data.frame(freqbk)
wfreqbk <- setNames(cbind(rownames(wfreqbk), wfreqbk, row.names = NULL), 
         c("word", "freq"))

wfreq_bk <- wfreqbk %>% filter(freq > 1)
write.csv(wfreq_bk, "word_frequency_brooklyn.csv")



# Plot word frequency for those words that occur more than 1000 times in the tdm
plot_bk <- ggplot(wfreq, aes(word, freq)) +
  geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1))+ 
  ggtitle("Word Count")

plot_bk


# Calculate word frequency 
# Manhattan 
freq_mn <- colSums(as.matrix(dtm_sum_mn)) 
len_dtm_mn <- freq_mn %>% length()
# Word matrix length (no words included in word frequency) 
len_dtm_mn

wfreq_mn <- as.data.frame(freq_mn) %>% subset(freq_mn > 5000)
wfreq_mn <- setNames(cbind(rownames(wfreq_mn), wfreq_mn, row.names = NULL), 
         c("word", "freq"))


# Plot word frequency for those words that occur more than 1000 times in the tdm
plot_mn <- ggplot(wfreq_mn, aes(word, freq)) +
  geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1))+ 
  ggtitle("Word Count")

plot_mn





wfreq_bk <- as.data.frame(freqbk) %>% subset(freqbk > 5000)
wfreq_bk <- setNames(cbind(rownames(wfreq_bk), wfreq_bk, row.names = NULL), 
         c("word", "freq"))


# Plot word frequency for those words that occur more than 1000 times in the tdm
plot_bk <- ggplot(wfreq_bk, aes(word, freq)) +
  geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1))+ 
  ggtitle("Word Count")

plot_bk

These two plots demonstrate the slight differences in the wording used to represent Manhattan apartments compared to Brooklyn apartments. I will further investigate these differences using some correlations and also further analysis. I also focus more specifically on Brooklyn data for clustering and topic modeling.

findAssocs(mn_tdm_2, c("diverse"), corlimit = 0.01)   
$diverse
numeric(0)

It’s interesting to note that the word correlations differ in the two boroughs, specifically things like:

‘access’ in Brooklyn is associated with words like ‘backyard’ and ‘manhattan’ compared to ‘access’ in Manhattan, which isn’t associated with backyards. This obviously makes sense since it’s more likely that one would have a backyard in Brooklyn, but not in Manhattan

Since in my eventual project I am interested in neighborhood and perception of neighborhood, I have chosen it as one of the words that I focus on when looking for the other terms associated with it.

For both subsets, it’s interesting to note that ‘safe’ is most highly correlated with the word ‘neighborhood’. Since I’m currently looking at the listings data, I am not surprised that ‘safe’ as a term is often found next to neighborhood, since Airbnb listers are attempting to market their home or listing, and thus advertise it and its surrounding area as ‘safe’.

It is also interesting for me in my work to note that ‘diverse’ appears in association with neighborhood in Brooklyn, while not in Manhattan. This may be an interesting layer of information to further delve into, perhaps as it relates to Airbnb prices, reviews, and interest in Brooklyn, compared to the rest of the city, and how that compares to other studies on demographics and gentrification, for example.

Some Basic Clustering using K-means


# Brooklyn only 
# Matrix Term Frequency 
dtm_sum_tf <- weightTfIdf(dtm_sum)
dtm_sum_m  <- as.matrix(dtm_sum_tf)

# Some K-means clustering initially 
set.seed(1)

# Clusters set to 4
k <- 4

# k-means clustering 
km_out <- kmeans(dtm_sum_m, centers  = k)
colnames(km_out$centers) <- colnames(dtm_sum_m)
str(km_out)

# Listings in the fourth cluster: what words are used most commonly for example
# Brooklyn

names(head(sort(km_out$centers[4,], decreasing = TRUE), 16))
as.integer(gsub("fp", "",
           gsub(".txt", "", names(which(km_out$cluster == 4)), fixed = TRUE)))

Topic Modeling

After having done some basic clustering, and seeing the limitations based on the complex text data, which needs to be further cleaned up to really make meaningful clusters at a higher level. We move on to topic modeling, which will allow me to extract and consider what kind of topics are most common in listing descriptions.

This information is useful for further research because it will be interesting to understand the probability of certain listings’ topics covered, and be able to associate it with other additional data, like demographics, changing urban planning policy or real-estate markets.

This initial project involves determining these topics. we choose k = 5, and run a latent dirichlet allocation model.

library(topicmodels)

# First set the basic parameter for topic modeling 
seed <-list(2003,5,63,100001,765)
k = 5

# LDA for Gibbs sampling
# Brookyln 

rowTotals <- apply(dtm_sum, 1, sum) #Find the sum of words in each Document
dtm_new   <- dtm_sum[rowTotals > 0, ]           #remove all docs without words

lda_out_bk <-LDA(dtm_new, 5, method = "Gibbs",
             control = list(nstart = 5, seed = seed,
                          best = TRUE, burnin = 4000,
                          iter = 2000, thin = 500))

# Save and print terms to observe grouping and topics generally
t <- terms(lda_out_bk, 5)
t
lda_out_topics <- as.matrix(topics(lda_out_bk))

# extract probabilities that a listing will fall within certain topic based on k = 5 topics
lda_out_terms <- as.matrix(terms(lda_out_bk, 6))
topicp_probabilities_bk <- as.data.frame(lda_out_bk@gamma)

# Summary of probability that a certain topic occurs for a given listing
summary(topicp_probabilities_bk)

# Rename data frame column
topicp_probabilities_bk <- setNames(cbind(rownames(topicp_probabilities_bk), topicp_probabilities_bk, row.names = NULL), c("id", "Brooklyn", "Space", "Lodging", "Environment", "Location"))


ggplot(data = topicp_probabilities_bk, aes(Brooklyn, "Topics")) + 
     geom_point(aes(x = Space, y = Brooklyn), colour = alpha('red', 0.05)) + 
     geom_point(aes(x = Lodging, y = Brooklyn), colour = alpha('blue', 0.05)) +  
     geom_point(aes(x = Environment, y = Brooklyn), colour = alpha('green', 0.05)) + 
     geom_point(aes(x = Location, y = Brooklyn), colour = alpha('purple', 0.05))


# Calculate relative importance of topic 1 to topic 2
topic1_topic2 <- lapply(1:nrow(dtm_new), function(x)
sort(topicp_probabilities_bk[x, ])[k] / sort(topicp_probabilities_bk[x, ])[k - 1])

# Calculate relative importance of topic 2 to topic 3
topic2_topic3 <- lapply(1:nrow(dtm_new), function(x)
sort(topicp_probabilities_bk[x, ])[4] / sort(topicp_probabilities_bk[x, ])[3])

# Probabilities that listing will fall in one of the topics
mean(round(posterior(lda_out_bk, dtm_new)$topics, digits = 3))

Conclusions and Further Research

From this project, we have developed an understanding of the structure and topics that generally appear in Airbnb listings, specifically for Brooklyn, and New York City. This is a setting off point for further understanding and analysis into the ways Airbnb data can be used to look into the changing urban environment, and the ways neighborhoods and cities are marketed. While Airbnb data has largely been used to predict pricing and analyze how pricing is determined based on listings, reviews, neighborhood, and other such factors.

We specifically interested not in understanding or predicting pricing or Airbnb use, but the ways one can study the changing landscape of cities and how city residents and visitors perceive them.

Because of this, this initial project has given me the foundations to further dive into this topic. Topic modeling has demonstrated the way in which place is a significant part of Airbnb listings. Furthermore, the neighborhood, place, and relationships between spaces are significant topics that appear and can be used to sort listing data.

Further research in this direction will focus on spatially understanding how these topics present themselves, using the location-based data available in the given dataset.

Furthermore, diving into the ways in which review data might be matched to listing data, to be able to look at both sides of representation of neighborhood or place, would be a useful further direction.

