Scraping, Analyzing, and Visualizing Harry Potter Fan Fiction

Me and my friends playing Quidditch, 2009. Please don’t zoom in.
Photo taken by me in my grandparents’ bedroom, 2011, made ominous by the photoshop skills of my cousin
  • seven interactive graphs to play with
  • a walkthrough of data gathering, cleaning, manipulating, and visualizing
  • some tips for learning d3
  • some nostalgia for Harry Potter
  • an appreciation of fan fiction
  • a warm, magical feeling

Scraping

  • Weekend spent building and testing the scraper on one or two pages of FanFiction.net
  • Two full days of letting the scraper run, then coming back from work and finding that there were some anomaly cases where HTML wasn’t formatted right on FanFiction.net. FanFiction.net’s terms of service ask that you do not load its servers more heavily than anything a human could do, so I had it scraping an index page (25 stories), every five seconds, then moving on to the next page.
  • One full day of letting the scraper run, this time backwards. This way I didn’t have to check for when the last page was reached (the last page would now always be page 1) and so that I wouldn’t miss stories being published while my scraper was running. Came back from work to find that one story did not have an author ID- made a catch statement for that, then ran the scraper again, lowering the sleep time to 2 seconds.
  • Two full days of letting the scraper run, resulting in the desired JSON file!
Sample index page on FanFiction.net
  • Title
  • Author
  • Rating (K, T, M, etc. for how mature of an audience the fic is meant for)
  • Language
  • Genre
  • Number of words
  • Number of reviews
  • Number of favorites
  • Number of follows
  • Last updated timestamp
  • Publication timestamp
  • Character tags

Cleaning and Manipulating

  • CSV file of number of fan fictions published about a given character on a given day
  • Counts of how many times characters were tagged in fan fictions
  • Counts of words used in titles
  • Co-occurrency matrix for how often two characters co-occurred
  • JSON of links and nodes between popular characters
pandas dataframe of a co-occurrence matrix

Visualizing

  • Read Scott Murray’s d3 tutorial- that guy is hilarious! Everything was very clearly explained. I had a good amount of HTML/CSS/JS background, but I think it’s pretty good even for those with minimal background. Also make sure to read his talk on transitions.
  • Try making some bar graphs or scatterplots of data you’re really interested in. Scott covers bar graphs and it’s very rewarding to see graphs of data you care about.
  • Make some silly transitions. There’s no reason why the first graph I made you could change the color to match the different houses (red/gold for Gryffindor, blue/bronze for Ravenclaw), except that I wanted to try it and it’ll get you familiar with the enter/update/exit paradigm
  • Read and very slowly walk through some blocks, especially the creator himself’s.
First attempt at a scatterplot- slow load times, slower zoom/transition times, hard to see events
Number of fan fictions about the character Harry Potter from 1999–2017. Mouse hovering over the movie real icon of when HP7 part 2 the movie was released.
Zoom in on the spike in 2016 when filtering for fics tagged with Snape
Gender distribution of Top 50 most mentioned Harry Potter characters in JK Rowling’s books
Gender distribution of Top 50 most tagged characters in fan fiction- mouse hover over Lily Potter
Word cloud of titles of fan fiction tagged as being about Ron and Hermione- mouse hovering over ‘Weasley’
Word cloud of titles of fan fiction tagged as being about Sirius and Lupin- mouse hovering over ‘Chocolate’
Cooccurrence matrix- mouse hovering over Harry x Draco
Character relationship graph- mouse over Harry

The Power of Fan Fiction

--

--

--

software engineer | writer

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Make a website dynamic using AWS API Gateway

How to Gain an Accurate Insight Into Your Competitors With Web Categorization APIs

Code Kata #3 Roman Numeral => Integer

How To Get Lead Information With An API

Committed to being Pushy

Pure PHP Snake game in browser, zero JS

Why You Should Experience Solana Mobile Phone At Least Once In Your Lifetime

Functions: A practical approach

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Allison King

Allison King

software engineer | writer

More from Medium

Something Geeky This Way Comes…

ASPIRING JOURNALISTS MUST BE AWARE OF THE DANGERS ASSOCIATED WITH THEIR PROFESSION AND TAKE LIFE…

Text Cleaning and extraction using R

See, Imagination Is The Empowerment Tool To Create Your Future