Politiclimate 2.0

Politiclimate 1.0

Originally I was training my model on 1.4 million tweets gathered from the 25 most liberal and 25 most conservative urban counties filtered only by the location of the tweet.  After training on the the collections of tweets, my models could predict whether a county was conservative or liberal based on a new corpus of gathered tweets.

I discovered some very important issues in my original process:

  • There was far too much noise.  Sports teams, local events and the countless  non-politically themed tweets created so much irrelevant data that the models did not train or test effectively – I needed to weed out the political tweets
    • Twitter’s streaming API does not allow for both location AND topic filtering
  • The time of the scraping of the original tweets only contains data relevant to the period and political climate.  For example, there was notable mention of the Alabama election in the training set, but not so much in the test set gathered 2 weeks later
  • Twitter user demographics heavily skew results towards liberal, given age

 

The Original Politiclimate Presentation

 


Politiclimate 2.0

Politiclimate 2.0 is the continuation of my General Assembly Capstone Project.  The new goal is to create a website that contains a map based GUI live feed of politically charged tweets categorized as “Red” or “Blue” (Liberal or Conservative). Users can explore the hot-button issues discussed on social media in a given time and location and filter by political affiliation (with a margin of error of course).

Currently, the script can:

  • Run a daily scrape of relevant topical information using politically themed subreddits
  • Extract most potent topics using word count and LDA
  • Gather tweets based on up to 50 locations and 500 filters (topics scraped from political subreddits)
  • Store tweets in a remote Mongo Database
  • Clean the tweets using NLTK and custom scripts
  • Convert emojis into human-readable expressions and extract topics of interest

The next steps are:

  • Implementation of a training and testing pipeline for political leaning by developing a model for targeted sentiment analysis on topics for both conservative and liberal subreddits
  • Create a Graphical User Interface
    • Live feed of “Red” and “Blue” tweets populating a map
    • Statistics, charts and tables that display statistics of the data on Politiclimate.com (using React and possibly Django)
      • For example, explore the topical analysis of tweets in Alabama during the Moore/Jones election
  • Automate the process efficiently in 24 hour cycles:
    • Scrape new Liberal and Conservative topics (using LDA on subreddit headlines in political subreddits)
    • Run sentiment analysis of topics
    • Train new model with added sentiment analysis and topics
    • Scrape tweets using topics as filters
    • Display trickle feed of filtered tweets using pins on the politiclimate map (eye candy)

More to come soon 🙂