Scraping, Analyzing, and Visualizing Harry Potter Fan Fiction

When I was a kid, like many other kids out there, I really loved Harry Potter. Big deal, you’re probably thinking so did half the world! I don’t think my story is unique- I grew up waving a chopstick around as my wand, yelling Expecto Patronum out the window at passing cars, dressing up with my friends as each of Voldemort’s horcruxes for midnight movie screenings, and building Quidditch hoops out of PVC pipes and hula hoops in the basement. I can’t quite explain the extent to which I loved the series without going into fan fiction though, which I both read and wrote.

Me and my friends playing Quidditch, 2009. Please don’t zoom in.

It’s been a while since my fan fiction days, and it’s been a while since the last Harry Potter book came out. Since then, I think I’ve matured a bit. I went to college, I majored in engineering, someone willingly hired me, and now I code pretty often. I’ve also started reading arguably more sophisticated things, like The New York Times, where I admired the visualizations from afar. But then I thought, hey, I code too- maybe I can make visualizations like that.

So I set out to learn d3.js, a JavaScript library I had heard rumors and legends about and which powers many of the NYT’s visualizations. I also needed a topic- something that would be interesting to graph and whose data I would not mind staring at for days to come. Then I heard my string of Harry Potter paraphernalia call out to me, a line of Harry Potter puppets an ex had made for me years ago plead to me.

Photo taken by me in my grandparents’ bedroom, 2011, made ominous by the photoshop skills of my cousin

I ended up learning how to build a web scraper, manipulate the data, and use d3.js to make seven graphs aimed at showing the importance of fan fiction in a month.

To just see all of the graphs and my analysis right away, take a look at my writeup here. To see just the graphs and their code, check out my blocks. To see all of the code including scraping, manipulating, and visualizing, visit the GitHub repository. In this article, I’ll cover the process which I went about all of that, and hope to leave you with:

Scraping

Even though my primary goal was to learn d3, I needed to start with some data (d3 stands for Data Driven Documents, after all!). The site that I frequented as a teenager was FanFiction.net and so that seemed like a natural place to start. I had never built a web scraper before, but I was able to get one up and running pretty quickly just by following some BeautifulSoup tutorials. I wrote this in Python, a language I had a beginner to moderate background in. All in all, it took me about a week to fully scrape the metadata of 560,000 Harry Potter fan fictions and store it into one 184MB JSON file. That week was approximately broken down like this:

That JSON file contained metadata about all of the stories. This is what an index page of Harry Potter fan fiction looks like on FanFiction.net:

Sample index page on FanFiction.net

I saved off all of the data available in these blurbs. Ultimately, the data I had was:

What surprised and excited me most was how active the Harry Potter fan fiction community still is. Even as I was testing the scraper, I found that the data was changing- people were updating or publishing new stories every hour or so!

Cleaning and Manipulating

Now that I had all of my data in a rather large JSON file, I realized there was another step I needed to take before I could plunge into visualization. If I wanted people to view my graphs and have a reasonable experience, their client could not be expected to download an almost 200MB JSON file every time, then wait for my JavaScript to parse out only what it needed from each graph. I set out to use Python and the pandas library to filter out just the necessary data for each of my graphs. Then, of course, I needed to know what each of my graphs would be about. There was a lot of back and forth on this- I would manipulate the data, visualize it, realize the resultant graph wasn’t useful, go back and manipulate the data, visualize it again, and repeat. In the end, the data files I turned my large JSON file into were something like this:

pandas dataframe of a co-occurrence matrix

Visualizing

Finally, visualizing! I looked up ways to learn d3 and ultimately what I found worked best for me was:

The first visualization I made was a scatterplot of number of fan fictions published about a character on a given day. The first fan fiction that is still on the site was published in 1999. I wanted to see what events made people write fan fiction and to do that I needed to collect significant events in history related to Harry Potter, such as when each book was published, each movie released, when JK Rowling made announcements, etc. I started off trying to load all of the publication dates for every character, but this still was too large of a file and made the user wait a while.

First attempt at a scatterplot- slow load times, slower zoom/transition times, hard to see events

I ended up going from publication day to publication week, then filtering out for characters I thought would be interesting. I also moved the lightning bolt events which originally corresponded to number of fics written on that day, to up higher to be more visible. Originally I drew an SVG lightning bolt shape myself. Then, I decided to change the book publication event dates to a book icon, the movie release dates to a movie reel, both of which used the FontAwesome library- which also had a lightning bolt shape!

This graph in the end looked like this:

Number of fan fictions about the character Harry Potter from 1999–2017. Mouse hovering over the movie real icon of when HP7 part 2 the movie was released.

This graph allows the user to hover over events and see if they correspond to spikes. Most of these spikes turned out to be rather predictable, with spikes occurring after each movie/book release date. If we switch to different characters, we often see different spikes- for instance, Sirius has one after the 5th book, and Snape, curiously enough, has one after the actor who played him in the movies, Alan Rickman, passed away.

Zoom in on the spike in 2016 when filtering for fics tagged with Snape

We begin to see now a blurred line between fiction and reality- it is understandable that Sirius fan fiction spiked after Sirius the character died in the books, but for Snape fan fiction to rise after the real life actor passed away is something different. Perhaps it shows the power Alan Rickman’s acting, to thoroughly blur the lines between himself and the character he played- or maybe it shows the potential healing behind fan fiction- the ability to turn to a fandom that is also grieving and turn that grief into art.

I like to think that my subsequent graphs were a bit neater. Here’s one of the gender distribution in the top 50 most mentioned characters in the original seven Harry Potter books:

Gender distribution of Top 50 most mentioned Harry Potter characters in JK Rowling’s books

There’s a transition with object constancy that lets you see the characters move either up or down in popularity when we look at how often they are tagged in fan fiction instead. Here’s the end product of that transition:

Gender distribution of Top 50 most tagged characters in fan fiction- mouse hover over Lily Potter

Hovering over a bar let’s us see how that character changed from the last graph to this one. In this case, Lily Potter went from being not in the 50 most mentioned characters, to being the 7th most written about character in HP fan fiction. We also see an obvious increase in the number of female characters (denoted by gold bars) when we go from canon to fan fiction. It is likely that most people who read/write fan fiction identify as female, so maybe this is a case of just writing what you know, but maybe we are also seeing fan fiction fill a gap in canon. Harry Potter, despite some strong female characters, is primarily a male driven story. We see here that fan fiction has the power to showcase voices that aren’t heard as much in the mainstream.

Which leads me to my next graph about slash fiction. Slash fiction, or fiction that pairs together two characters of the same gender, has a bit of a reputation in the fan fiction community. Gay characters, like female characters, are also underrepresented in mainstream fiction, but very popular in fan fiction. With only the metadata and not much natural language processing background, I decided to make a word cloud of titles of fan fictions for both a a straight, canon relationship, as well as a gay, fan favorite relationship.

Word cloud of titles of fan fiction tagged as being about Ron and Hermione- mouse hovering over ‘Weasley’
Word cloud of titles of fan fiction tagged as being about Sirius and Lupin- mouse hovering over ‘Chocolate’

In this graph, by hovering over a word, you can see three randomly chosen titles that have that word in them. If you compare these graphs, you’ll see that they look very similar. There are large words for each of the character names, as well as for their friends (‘Harry’ and ‘Potter’ for Romione, and ‘Marauders’ for Wolfstar), and for ‘Christmas’ (ABC might have something to do with that). There are differences, of course- Wolfstar fan fiction seems to often have ‘Chocolate’ and ‘Moon’, a reference to Lupin having offered Harry chocolate once and to his being a werewolf, but the largest word in both pictures is ‘Love’. After all, love is love, whether you are the brightest witch of your age or were sent to prison for thirteen years and can turn into a dog.

Cooccurrence matrix- mouse hovering over Harry x Draco

I made this cooccurrence matrix to visualize how characters are related to each other in fan fiction. Darker cells indicate more cooccurrences. The matrix is limited in what it can show us though. Although darker tiles seem to indicate deeper relationships, a closer examination show that the darker tiles tend to belong to more minor characters who were not developed in the canon as much and so are mostly attached to one character (e.g. Astoria Greengrass to Draco Malfoy, James and Lily, Rose Weasley and Scorpius Malfoy, if we include Cursed Child). Interestingly, this seems to also apply to Ron to Hermione, though both are considered very main characters. Ron’s popularity falls in the prior bar chart as well- perhaps the fandom sees him as a relatively peripheral character?

The last graph is an actual graph, with nodes and links and everything!

Character relationship graph- mouse over Harry

This graph shows the top 15 most popular characters and their 5 top cooccurrences. Some characters are more correlated than others- for instance Harry is very correlated, whereas Charlie Weasley (CW on the left) is only in the graph since he is one of Nymphadora Tonks’ top five relationships. The graph naturally falls into cliques- marauder generation in the top left corner, next generation in the bottom right corner, Harry’s generation in the middle.

The grey circle near the middle is particularly interesting- after Harry and Hermione, the male and female leads of the Harry Potter series, Original Character is the most interconnected character. Fans have created their own characters to intermingle with seemingly every corner of the fandom, from the Marauder era, to next generation, to Tom Riddle Jr. By writing an original character, an author has engaged in an exercise of empathy- in the case of self insert fiction, they have wondered how they would handle situations in the Harry Potter universe, and in the case of a character separate from the author, they have wondered how this type of person would fit in. It is through original characters, as well as through breathing life into minor characters in the canon, that a reader can have a conversation with the author.

The Power of Fan Fiction

Overall, I embarked on this journey to learn how to make neat visualizations. I certainly learned something! There’s still a ton to learn about d3 as well as what makes a good visualization though. I suppose a good visualization lets you explore and discover and teaches something at the same time. I wasn’t sure what I would discover or learn from the resultant graphs, but I guess I’d summarize my findings by saying that fan fiction fills a space that is normally left empty.

What happens when an author publishes a story? The story goes out to readers who are often changed by these stories. They might reach out to the author and let them know, but other than that, the effects of a story are often lost. In Harry Potter, we grew up with a story that taught us themes of tolerance and love that followed JK Rowling’s beliefs. Unique to writing novels is the individualism of it- you don’t see blockbuster movies created by a single person and it’s not often that you see a song written, composed by, and sung all by the same artist.

While JK Rowling was able to share some themes, there was no space for discussion about how these topics were presented, whether with fans or with co-creators. So while Harry Potter has some strong female characters, some people of color, and one revealed after the fact gay character, JK Rowling is ultimately only one person with experiences based on that of a straight, white, woman. In the spirit of seeing an already loved story become better and more representative, fans insert their own experiences and in that way, perhaps we achieve a story better than the original, or at least one more representative of humanity. And that, to me, is pretty magical.

software engineer | writer