Scraping, Analyzing, and Visualizing Harry Potter Fan Fiction

Me and my friends playing Quidditch, 2009. Please don’t zoom in.
Photo taken by me in my grandparents’ bedroom, 2011, made ominous by the photoshop skills of my cousin
  • seven interactive graphs to play with
  • a walkthrough of data gathering, cleaning, manipulating, and visualizing
  • some tips for learning d3
  • some nostalgia for Harry Potter
  • an appreciation of fan fiction
  • a warm, magical feeling

Scraping

  • Weekend spent building and testing the scraper on one or two pages of FanFiction.net
  • Two full days of letting the scraper run, then coming back from work and finding that there were some anomaly cases where HTML wasn’t formatted right on FanFiction.net. FanFiction.net’s terms of service ask that you do not load its servers more heavily than anything a human could do, so I had it scraping an index page (25 stories), every five seconds, then moving on to the next page.
  • One full day of letting the scraper run, this time backwards. This way I didn’t have to check for when the last page was reached (the last page would now always be page 1) and so that I wouldn’t miss stories being published while my scraper was running. Came back from work to find that one story did not have an author ID- made a catch statement for that, then ran the scraper again, lowering the sleep time to 2 seconds.
  • Two full days of letting the scraper run, resulting in the desired JSON file!
Sample index page on FanFiction.net
  • Title
  • Author
  • Rating (K, T, M, etc. for how mature of an audience the fic is meant for)
  • Language
  • Genre
  • Number of words
  • Number of reviews
  • Number of favorites
  • Number of follows
  • Last updated timestamp
  • Publication timestamp
  • Character tags

Cleaning and Manipulating

  • CSV file of number of fan fictions published about a given character on a given day
  • Counts of how many times characters were tagged in fan fictions
  • Counts of words used in titles
  • Co-occurrency matrix for how often two characters co-occurred
  • JSON of links and nodes between popular characters
pandas dataframe of a co-occurrence matrix

Visualizing

  • Read Scott Murray’s d3 tutorial- that guy is hilarious! Everything was very clearly explained. I had a good amount of HTML/CSS/JS background, but I think it’s pretty good even for those with minimal background. Also make sure to read his talk on transitions.
  • Try making some bar graphs or scatterplots of data you’re really interested in. Scott covers bar graphs and it’s very rewarding to see graphs of data you care about.
  • Make some silly transitions. There’s no reason why the first graph I made you could change the color to match the different houses (red/gold for Gryffindor, blue/bronze for Ravenclaw), except that I wanted to try it and it’ll get you familiar with the enter/update/exit paradigm
  • Read and very slowly walk through some blocks, especially the creator himself’s.
First attempt at a scatterplot- slow load times, slower zoom/transition times, hard to see events
Number of fan fictions about the character Harry Potter from 1999–2017. Mouse hovering over the movie real icon of when HP7 part 2 the movie was released.
Zoom in on the spike in 2016 when filtering for fics tagged with Snape
Gender distribution of Top 50 most mentioned Harry Potter characters in JK Rowling’s books
Gender distribution of Top 50 most tagged characters in fan fiction- mouse hover over Lily Potter
Word cloud of titles of fan fiction tagged as being about Ron and Hermione- mouse hovering over ‘Weasley’
Word cloud of titles of fan fiction tagged as being about Sirius and Lupin- mouse hovering over ‘Chocolate’
Cooccurrence matrix- mouse hovering over Harry x Draco
Character relationship graph- mouse over Harry

The Power of Fan Fiction

--

--

--

software engineer | writer

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

AWS Security Best Practices — AWS Secrets Manager — #CloudSecurity

Discover The Updated Prices Of Crude Oil Using This API

Python in Flutter App

Serverspace cloud predictions for 2021: Powering the economy post Covid-19

Covid-19 Notification System using AWS Cloud, Alexa Skill and Facebook Prophet for Time Series…

How to learn any new software quickly?

Image created by: Shubhi Thakuria

Progressive web vs. native apps: Who will own the mobile experience?

CSS Animations & Scalable Vector Graphics

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Allison King

Allison King

software engineer | writer

More from Medium

To understand how data storytelling works, devour the extraordinary film, The Beatles: Get Back.

The View Behind the Curtain: My Experience as a 2022 Iron Viz Judge

How to make data visualizations (or any design) more accessible?

The Art of Research Paper Writing and the Research Iceberg Analogy

In a balanced iceberg (most below the surface of the water), the paper clearly and accurately distills the research. In an unbalanced iceberg (most above the surface of the water), every aspect of the research is crammed into a paper submission. In the balanced case, reviewers have a clear understanding of the contributions.