Twitter Data Mining, tutorial 1¶

This tutorial will cover following Twitter data mining techniques:¶

1. Data collection¶

Connect to Streaming API and collect live tweets

2. Exploratory data analysis (EDA)¶

Plot histogram of followers count
Plot tweet volume time series
Plot word cloud of tweet corpus
Plot tweets with geo-coordinates on an interactive map

Plot tweet-retweet social network graph
Community detection in Gephi
NLP: topic classification (NMF) for different tweet communities

Prerequisite¶

You must have cloned and installed Twitter Bot Monitor on your computer. Please go to Github repo for installation instruction. This repo is private so please email me your github handle so I can grant you access.

Contact¶

Zhouhan Chen zc1245@nyu.edu

Import core libraries¶

import json
import util
import streamer
import datetime
import numpy as np
import pandas as pd
import detect
from twitter_credential import token_dict
from collections import defaultdict

Collect live tweets with keyword "trump", detector will create a folder named "trump_test2" if the folder does not exist¶

prefix = 'trump_test2'
keyword = ['trump']
num_tweets = 20000
duration = 3600
auth_key = 'streaming_1'

src_path = util.get_full_src_path(prefix)
print("The absolute path of raw data file is")
print(src_path)
print()

full_prefix = util.get_full_prefix(prefix)
print("The prefix for all subsequent files is")
print(full_prefix)
print()

tweetStreamer = streamer.Streamer(auth_key=token_dict[auth_key])
tweetStreamer.collect(keyword=keyword, filename=src_path, 
                      num_tweets=num_tweets, duration=duration, whitelist = [],
                      save_file = True, print_info = "info")


detector = detect.SpamDetector(prefix=full_prefix, url_based = False,
                               sourcefile=src_path)

# generate user info dictionary 
detector.save_user_info()

The absolute path of raw data file is
/Users/zc/Documents/twitter_data/stream/trump_test2.txt

The prefix for all subsequent files is
/Users/zc/Documents/twitter_data/processed_data/trump_test2/trump_test2_tweet_

******************************************
                     * File already exists!
                     *
                     * You must change your prefix or delete the 
                     * original file
                     ******************************************
                  
start init...
/Users/zc/Documents/twitter_data/processed_data/trump_test2
directory already exists
finish init...
file already exists......

How many followers do those users have? Let's make a histogram¶

# EDA: plot the distribution of followers count
import matplotlib.pyplot as plt
plt.style.use('classic')
%matplotlib inline

import seaborn as sns
sns.set()

followers_count = []
for tweet in util.loadjson(src_path):    
    followers_count.append(tweet['user']['followers_count'])


print("followers count mean is ", np.mean(followers_count))
print("followers count std is ", np.std(followers_count))

followers_count = [num for num in followers_count if num < 10000]
plt.hist(followers_count, alpha=0.5, bins=20)
plt.xlabel('Followers count')
plt.ylabel('Number of accounts')  
plt.title('Histogram of followers count')

read 5000 tweets...
read 10000 tweets...
read 15000 tweets...
Finish loading 19821 tweets
followers count mean is  6786.20326926
followers count std is  126908.923233

<matplotlib.text.Text at 0x10e783518>

What words are mostly tweeted? Let's make a word cloud¶

from utility.wordcloud_maker import generate_cloud
from IPython.display import Image
from IPython.core.display import HTML 

text = []
for tweet in util.loadjson(src_path):    
    text.append(tweet['text'])

generate_cloud(' '.join([t for t in text]))   #, full_prefix + 'wordcloud') 
# Image(filename = full_prefix + 'wordcloud.png')

read 5000 tweets...
read 10000 tweets...
read 15000 tweets...
Finish loading 19821 tweets

What is the tweet volume by minute? Let's make a time series¶

dates = []
for tweet in util.loadjson(src_path):    
    dt = datetime.datetime.strptime(tweet['created_at'], '%a %b %d %H:%M:%S +0000 %Y')  # format the time
    dates.append({'timestamp': pd.Timestamp(dt)})

df = pd.DataFrame(dates)
times = pd.to_datetime(df.timestamp)
df.groupby([times.dt.minute]).count().rename(columns={"timestamp": "count"}).plot()
plt.xlabel('minute')
plt.ylabel('Number of tweets')
plt.title('Tweet time series')

read 5000 tweets...
read 10000 tweets...
read 15000 tweets...
Finish loading 19821 tweets

<matplotlib.text.Text at 0x112058048>

Where are tweets with geo-coordinates? Let's plot them on a map!¶

from mapboxgl import utils 
from mapboxgl import viz

token = "pk.eyJ1IjoieWFucGVuZ3BlcnJ5IiwiYSI6ImNqZjkzcXU0ODBvaHMyeW9iNjVvcDVvazcifQ.MoKCHwmangU5Re0okuPB_g"

coordinates = []

for tweet in util.loadjson(src_path):    
    if "coordinates" in tweet and tweet["coordinates"] is not None:
        print("find a coordinate")
        coordinates.append(tweet["coordinates"]["coordinates"] + [tweet["user"]["statuses_count"]])


df_coordinates = pd.DataFrame(coordinates, columns=['lon', 'lat', 'statuses_count'])

# Create a geojson file export from a Pandas dataframe
viz_coordinates = utils.df_to_geojson(df_coordinates, properties=['statuses_count'],
              lat='lat', lon='lon', precision=3)

# Create the viz from the dataframe
tweet_on_map = viz.CircleViz(viz_coordinates,
                access_token=token,
                radius = 0,
                center = (46.710368,23.626842),
                zoom = 2,
                stroke_width = 3,
#                 color = 'red',
                stroke_color = 'red',
              )
tweet_on_map.show()

find a coordinate
read 5000 tweets...
read 10000 tweets...
read 15000 tweets...
find a coordinate
Finish loading 19821 tweets

Construct tweet-retweet relationship dictionary¶

c = defaultdict(lambda: defaultdict(int))
num_tweet = 1

for tweet in util.loadjson(src_path):
    if 'retweeted_status' in tweet:
        c[tweet['retweeted_status']['user']['screen_name']][tweet['user']['screen_name']] += 1
        if tweet['user']['screen_name'] not in c:
            c[tweet['user']['screen_name']] = defaultdict(int)

read 5000 tweets...
read 10000 tweets...
read 15000 tweets...
Finish loading 19821 tweets

Initialize a graph object using social_network.py, which is a wrapper for networkX library. We will add an attribute to each node, and show some node and edge properties.¶

import social_network
TYPE = "directed"
g = social_network.Graph(c, TYPE, weighted = True)
g.build_graph()

# let's add a node attribute
user_info = json.load(open(full_prefix + "user_info.json", "r"))
node_attr = {}
for node in g.graph.node:
    if node in user_info and user_info[node]['verified']:
        print(node)
        node_attr[node] = 'verified'
    else:
        node_attr[node] = 'not_verified'
        
g.set_node_attributes("user_status", node_attr)

begin building dictionary
number of nodes added 11407
finish adding nodes
finish building dictionary
add node attribute user_status

len(g.graph.nodes())
g.graph.nodes()[:10]
list(g.get_node_attributes("user_status").items())[:10]

[('PoliticalShort', 'not_verified'),
 ('tadams1234bg', 'not_verified'),
 ('Goss30Goss', 'not_verified'),
 ('ArtzMarshall', 'not_verified'),
 ('mitchellvii', 'not_verified'),
 ('Clh1992Hahn', 'not_verified'),
 ('WayneDupreeShow', 'not_verified'),
 ('Jamikels0', 'not_verified'),
 ('AP', 'not_verified'),
 ('MitchDennison', 'not_verified')]

len(g.graph.edges())
g.graph.edges()[:10]

[('tadams1234bg', 'PoliticalShort'),
 ('ArtzMarshall', 'Goss30Goss'),
 ('ArtzMarshall', 'krassenstein'),
 ('ArtzMarshall', 'PalmerReport'),
 ('ArtzMarshall', 'funder'),
 ('ArtzMarshall', 'aroseblush'),
 ('ArtzMarshall', 'Quicks35'),
 ('ArtzMarshall', 'joncoopertweets'),
 ('ArtzMarshall', 'ChrisJZullo'),
 ('ArtzMarshall', 'Dangchick1')]

Generate .gexf file, and only store top 60 labels¶

g.generategexffile(full_prefix)
g.overwrite_default_label(full_prefix, 60)

start generating gexf file
finish generating gexf file
/Users/zc/Documents/twitter_data/processed_data/trump_test2/trump_test2_tweet_.gexf.swp
/Users/zc/Documents/twitter_data/processed_data/trump_test2/trump_test2_tweet_.gexf
file updated!!

I will demo how to load .gexf into Gephi, filter giant component, run community detection algorithm, and generate .png file¶

Image(filename = full_prefix + "community.png")

What topics are discussed in each community? Let's use unsupervised topic classification algorithm (NMF) to find out.¶

we first export Gephi data as a csv, then read into pandas dataframe, and uses scikit-learn to generate topics.¶

from nmf_topic_classify import runNMF

# identify class numbers from Gephi interface, and update those two variables
pro_trump_class = 1
anti_trump_class = 0

df = pd.read_csv(full_prefix + 'community.csv')
df.columns = ['Id', 'Label', 'interval', 'userID', 'user_status', 'componentID', 'modularity_class']  # one more column 'user_status' 
print(df.head())

df_pro_trump = df[df.modularity_class == pro_trump_class]
df_anti_trump = df[df.modularity_class == anti_trump_class]
    
user_pro_trump = set(df_pro_trump.Id)
user_anti_trump = set(df_anti_trump.Id)

communities = {
     "user_pro_trump": [user_pro_trump, []],
     "user_anti_trump": [user_anti_trump, []],
    }


for tweet in util.loadjson(src_path):
    for community, value in communities.items():
        user_names = value[0]
        user_tweets = value[1]
        if tweet['user']['screen_name'] in user_names:
            user_tweets.append(tweet['text'].lower())
            
# generate NMF topics
for community, value in communities.items():
    print("generating topic for community %s" %(community))
    user_tweets = value[1]
    n_features = 1000
    if len(user_tweets) < 1000:
        n_features = 100
    runNMF(dataset = user_tweets, n_features = n_features)

               Id Label  interval          userID   user_status  componentID  \
0  PoliticalShort   NaN       NaN  PoliticalShort  not_verified            0   
1    tadams1234bg   NaN       NaN    tadams1234bg  not_verified            0   
2      Goss30Goss   NaN       NaN      Goss30Goss  not_verified            0   
3    ArtzMarshall   NaN       NaN    ArtzMarshall  not_verified            0   
4     mitchellvii   NaN       NaN     mitchellvii  not_verified            0   

   modularity_class  
0                 1  
1                 1  
2                 0  
3                 0  
4                 1  
read 5000 tweets...
read 10000 tweets...
read 15000 tweets...
Finish loading 19821 tweets
generating topic for community user_pro_trump
Number of tweets collected (after preprocessing): 5328
Fitting the NMF model with n_samples=5328 and n_features=1000...
Topic #0:
co https rt primary 2jt1xlcqvp berniewasrobbed dloesch trump qanon card

Topic #1:
campaign russia wikileaks dnc trump democratic election suing 2016 party

Topic #2:
collusion lunatics fringe fairytale become need desperately russian hold dbongino

Topic #3:
comey memos trump marklevinshow indict 1fo6eu1oud rt president dossier investigation

Topic #4:
countersuit weasels realjameswoods wvqsladspd lol wait co https rt fees

Topic #5:
pompeo mike state sec denuclearization instrumental stump_for_trump confirm know democrats

Topic #6:
amp lisamei62 card coming realdonaldtrump trump popcorn sayi grab immediately

Topic #7:
think clinton emails tell_michelle_ logs texts count turn phone retweet

Topic #8:
cnn setup public orchestrate reports enemy helped suggest ltjgszrszd realsaavedra

Topic #9:
fbi director ag law constitution sebgorka nation correct chief according

generating topic for community user_anti_trump
Number of tweets collected (after preprocessing): 8887
Fitting the NMF model with n_samples=8887 and n_features=1000...
Topic #0:
26m novatek finance chairman huge giant lobbying elliott deputy offered

Topic #1:
co https rt trump realdonaldtrump people care thehill giuliani love

Topic #2:
russia campaign wikileaks democratic party lawsuit trump 2016 election breaking

Topic #3:
fun twice lunch sued time teapainusa fact today donald trump

Topic #4:
defamation mccabe wrongful termination sue admin amjoyshow oifhp5biiz thehill trump

Topic #5:
comey memos trump rt leaked congress within keep copies president

Topic #6:
sentence nzrtyqqk8l quite preetbharara co https rt play reminder one

Topic #7:
suit better rat pathologically trapped moron chief late robreiner lying

Topic #8:
john barron posing claimed owned father estate listen maddowblog qczjqh

Topic #9:
cohen maggienyt trump garbage gone treat coxnmre22n way says stone