Twitter Data Mining, tutorial 1

This tutorial will cover following Twitter data mining techniques:

1. Data collection

  1. Connect to Streaming API and collect live tweets

2. Exploratory data analysis (EDA)

  1. Plot histogram of followers count
  2. Plot tweet volume time series
  3. Plot word cloud of tweet corpus
  4. Plot tweets with geo-coordinates on an interactive map

3. Social network analysis (Gephi)

  1. Plot tweet-retweet social network graph
  2. Community detection in Gephi
  3. NLP: topic classification (NMF) for different tweet communities

Prerequisite

You must have cloned and installed Twitter Bot Monitor on your computer. Please go to Github repo for installation instruction.

Contact

Zhouhan Chen zc12@rice.edu

Import core libraries

In [1]:
import json
import util
import streamer
import datetime
import numpy as np
import pandas as pd
import detect
from twitter_credential import token_dict
from collections import defaultdict

Collect live tweets with keyword "trump", detector will create a folder named "trump_test2" if the folder does not exist

In [2]:
prefix = 'trump_test2'
keyword = ['trump']
num_tweets = 20000
duration = 3600
auth_key = 'streaming_1'

src_path = util.get_full_src_path(prefix)
print("The absolute path of raw data file is")
print(src_path)
print()

full_prefix = util.get_full_prefix(prefix)
print("The prefix for all subsequent files is")
print(full_prefix)
print()

tweetStreamer = streamer.Streamer(auth_key=token_dict[auth_key])
tweetStreamer.collect(keyword=keyword, filename=src_path, 
                      num_tweets=num_tweets, duration=duration, whitelist = [],
                      save_file = True, print_info = "info")


detector = detect.SpamDetector(prefix=full_prefix, url_based = False,
                               sourcefile=src_path)

# generate user info dictionary 
detector.save_user_info()
The absolute path of raw data file is
/Users/zc/Documents/twitter_data/stream/trump_test2.txt

The prefix for all subsequent files is
/Users/zc/Documents/twitter_data/processed_data/trump_test2/trump_test2_tweet_

******************************************
                     * File already exists!
                     *
                     * You must change your prefix or delete the 
                     * original file
                     ******************************************
                  
start init...
/Users/zc/Documents/twitter_data/processed_data/trump_test2
directory already exists
finish init...
file already exists......

How many followers do those users have? Let's make a histogram

In [3]:
# EDA: plot the distribution of followers count
import matplotlib.pyplot as plt
plt.style.use('classic')
%matplotlib inline

import seaborn as sns
sns.set()

followers_count = []
for tweet in util.loadjson(src_path):    
    followers_count.append(tweet['user']['followers_count'])


print("followers count mean is ", np.mean(followers_count))
print("followers count std is ", np.std(followers_count))

followers_count = [num for num in followers_count if num < 10000]
plt.hist(followers_count, alpha=0.5, bins=20)
plt.xlabel('Followers count')
plt.ylabel('Number of accounts')  
plt.title('Histogram of followers count')
read 5000 tweets...
read 10000 tweets...
read 15000 tweets...
Finish loading 19821 tweets
followers count mean is  6786.20326926
followers count std is  126908.923233
Out[3]:
<matplotlib.text.Text at 0x10e783518>

What words are mostly tweeted? Let's make a word cloud

In [4]:
from utility.wordcloud_maker import generate_cloud
from IPython.display import Image
from IPython.core.display import HTML 

text = []
for tweet in util.loadjson(src_path):    
    text.append(tweet['text'])

generate_cloud(' '.join([t for t in text]))   #, full_prefix + 'wordcloud') 
# Image(filename = full_prefix + 'wordcloud.png')
read 5000 tweets...
read 10000 tweets...
read 15000 tweets...
Finish loading 19821 tweets