Scraping, Analyzing, and Visualizing Harry Potter Fan Fiction
When I was a kid, like many other kids out there, I really loved Harry Potter. Big deal, you’re probably thinking so did half the world! I don’t think my story is unique- I grew up waving a chopstick around as my wand, yelling Expecto Patronum out the window at passing cars, dressing up with my friends as each of Voldemort’s horcruxes for midnight movie screenings, and building Quidditch hoops out of PVC pipes and hula hoops in the basement. I can’t quite explain the extent to which I loved the series without going into fan fiction though, which I both read and wrote.
It’s been a while since my fan fiction days, and it’s been a while since the last Harry Potter book came out. Since then, I think I’ve matured a bit. I went to college, I majored in engineering, someone willingly hired me, and now I code pretty often. I’ve also started reading arguably more sophisticated things, like The New York Times, where I admired the visualizations from afar. But then I thought, hey, I code too- maybe I can make visualizations like that.
I ended up learning how to build a web scraper, manipulate the data, and use d3.js to make seven graphs aimed at showing the importance of fan fiction in a month.
To just see all of the graphs and my analysis right away, take a look at my writeup here. To see just the graphs and their code, check out my blocks. To see all of the code including scraping, manipulating, and visualizing, visit the GitHub repository. In this article, I’ll cover the process which I went about all of that, and hope to leave you with:
- seven interactive graphs to play with
- a walkthrough of data gathering, cleaning, manipulating, and visualizing
- some tips for learning d3
- some nostalgia for Harry Potter
- an appreciation of fan fiction
- a warm, magical feeling
Even though my primary goal was to learn d3, I needed to start with some data (d3 stands for Data Driven Documents, after all!). The site that I frequented as a teenager was FanFiction.net and so that seemed like a natural place to start. I had never built a web scraper before, but I was able to get one up and running pretty quickly just by following some BeautifulSoup tutorials. I wrote this in Python, a language I had a beginner to moderate background in. All in all, it took me about a week to fully scrape the metadata of 560,000 Harry Potter fan fictions and store it into one 184MB JSON file. That week was approximately broken down like this:
- Weekend spent building and testing the scraper on one or two pages of FanFiction.net
- Two full days of letting the scraper run, then coming back from work and finding that there were some anomaly cases where HTML wasn’t formatted right on FanFiction.net. FanFiction.net’s terms of service ask that you do not load its servers more heavily than anything a human could do, so I had it scraping an index page (25 stories), every five seconds, then moving on to the next page.
- One full day of letting the scraper run, this time backwards. This way I didn’t have to check for when the last page was reached (the last page would now always be page 1) and so that I wouldn’t miss stories being published while my scraper was running. Came back from work to find that one story did not have an author ID- made a catch statement for that, then ran the scraper again, lowering the sleep time to 2 seconds.
- Two full days of letting the scraper run, resulting in the desired JSON file!
That JSON file contained metadata about all of the stories. This is what an index page of Harry Potter fan fiction looks like on FanFiction.net:
I saved off all of the data available in these blurbs. Ultimately, the data I had was:
- Rating (K, T, M, etc. for how mature of an audience the fic is meant for)
- Number of words
- Number of reviews
- Number of favorites
- Number of follows
- Last updated timestamp
- Publication timestamp
- Character tags
What surprised and excited me most was how active the Harry Potter fan fiction community still is. Even as I was testing the scraper, I found that the data was changing- people were updating or publishing new stories every hour or so!
Cleaning and Manipulating
- CSV file of number of fan fictions published about a given character on a given day
- Counts of how many times characters were tagged in fan fictions
- Counts of words used in titles
- Co-occurrency matrix for how often two characters co-occurred
- JSON of links and nodes between popular characters
Finally, visualizing! I looked up ways to learn d3 and ultimately what I found worked best for me was:
- Read Scott Murray’s d3 tutorial- that guy is hilarious! Everything was very clearly explained. I had a good amount of HTML/CSS/JS background, but I think it’s pretty good even for those with minimal background. Also make sure to read his talk on transitions.
- Try making some bar graphs or scatterplots of data you’re really interested in. Scott covers bar graphs and it’s very rewarding to see graphs of data you care about.
- Make some silly transitions. There’s no reason why the first graph I made you could change the color to match the different houses (red/gold for Gryffindor, blue/bronze for Ravenclaw), except that I wanted to try it and it’ll get you familiar with the enter/update/exit paradigm
- Read and very slowly walk through some blocks, especially the creator himself’s.
The first visualization I made was a scatterplot of number of fan fictions published about a character on a given day. The first fan fiction that is still on the site was published in 1999. I wanted to see what events made people write fan fiction and to do that I needed to collect significant events in history related to Harry Potter, such as when each book was published, each movie released, when JK Rowling made announcements, etc. I started off trying to load all of the publication dates for every character, but this still was too large of a file and made the user wait a while.
I ended up going from publication day to publication week, then filtering out for characters I thought would be interesting. I also moved the lightning bolt events which originally corresponded to number of fics written on that day, to up higher to be more visible. Originally I drew an SVG lightning bolt shape myself. Then, I decided to change the book publication event dates to a book icon, the movie release dates to a movie reel, both of which used the FontAwesome library- which also had a lightning bolt shape!
This graph in the end looked like this:
This graph allows the user to hover over events and see if they correspond to spikes. Most of these spikes turned out to be rather predictable, with spikes occurring after each movie/book release date. If we switch to different characters, we often see different spikes- for instance, Sirius has one after the 5th book, and Snape, curiously enough, has one after the actor who played him in the movies, Alan Rickman, passed away.
We begin to see now a blurred line between fiction and reality- it is understandable that Sirius fan fiction spiked after Sirius the character died in the books, but for Snape fan fiction to rise after the real life actor passed away is something different. Perhaps it shows the power Alan Rickman’s acting, to thoroughly blur the lines between himself and the character he played- or maybe it shows the potential healing behind fan fiction- the ability to turn to a fandom that is also grieving and turn that grief into art.
I like to think that my subsequent graphs were a bit neater. Here’s one of the gender distribution in the top 50 most mentioned characters in the original seven Harry Potter books:
There’s a transition with object constancy that lets you see the characters move either up or down in popularity when we look at how often they are tagged in fan fiction instead. Here’s the end product of that transition:
Hovering over a bar let’s us see how that character changed from the last graph to this one. In this case, Lily Potter went from being not in the 50 most mentioned characters, to being the 7th most written about character in HP fan fiction. We also see an obvious increase in the number of female characters (denoted by gold bars) when we go from canon to fan fiction. It is likely that most people who read/write fan fiction identify as female, so maybe this is a case of just writing what you know, but maybe we are also seeing fan fiction fill a gap in canon. Harry Potter, despite some strong female characters, is primarily a male driven story. We see here that fan fiction has the power to showcase voices that aren’t heard as much in the mainstream.
Which leads me to my next graph about slash fiction. Slash fiction, or fiction that pairs together two characters of the same gender, has a bit of a reputation in the fan fiction community. Gay characters, like female characters, are also underrepresented in mainstream fiction, but very popular in fan fiction. With only the metadata and not much natural language processing background, I decided to make a word cloud of titles of fan fictions for both a a straight, canon relationship, as well as a gay, fan favorite relationship.
In this graph, by hovering over a word, you can see three randomly chosen titles that have that word in them. If you compare these graphs, you’ll see that they look very similar. There are large words for each of the character names, as well as for their friends (‘Harry’ and ‘Potter’ for Romione, and ‘Marauders’ for Wolfstar), and for ‘Christmas’ (ABC might have something to do with that). There are differences, of course- Wolfstar fan fiction seems to often have ‘Chocolate’ and ‘Moon’, a reference to Lupin having offered Harry chocolate once and to his being a werewolf, but the largest word in both pictures is ‘Love’. After all, love is love, whether you are the brightest witch of your age or were sent to prison for thirteen years and can turn into a dog.
I made this cooccurrence matrix to visualize how characters are related to each other in fan fiction. Darker cells indicate more cooccurrences. The matrix is limited in what it can show us though. Although darker tiles seem to indicate deeper relationships, a closer examination show that the darker tiles tend to belong to more minor characters who were not developed in the canon as much and so are mostly attached to one character (e.g. Astoria Greengrass to Draco Malfoy, James and Lily, Rose Weasley and Scorpius Malfoy, if we include Cursed Child). Interestingly, this seems to also apply to Ron to Hermione, though both are considered very main characters. Ron’s popularity falls in the prior bar chart as well- perhaps the fandom sees him as a relatively peripheral character?
The last graph is an actual graph, with nodes and links and everything!
This graph shows the top 15 most popular characters and their 5 top cooccurrences. Some characters are more correlated than others- for instance Harry is very correlated, whereas Charlie Weasley (CW on the left) is only in the graph since he is one of Nymphadora Tonks’ top five relationships. The graph naturally falls into cliques- marauder generation in the top left corner, next generation in the bottom right corner, Harry’s generation in the middle.
The grey circle near the middle is particularly interesting- after Harry and Hermione, the male and female leads of the Harry Potter series, Original Character is the most interconnected character. Fans have created their own characters to intermingle with seemingly every corner of the fandom, from the Marauder era, to next generation, to Tom Riddle Jr. By writing an original character, an author has engaged in an exercise of empathy- in the case of self insert fiction, they have wondered how they would handle situations in the Harry Potter universe, and in the case of a character separate from the author, they have wondered how this type of person would fit in. It is through original characters, as well as through breathing life into minor characters in the canon, that a reader can have a conversation with the author.
The Power of Fan Fiction
Overall, I embarked on this journey to learn how to make neat visualizations. I certainly learned something! There’s still a ton to learn about d3 as well as what makes a good visualization though. I suppose a good visualization lets you explore and discover and teaches something at the same time. I wasn’t sure what I would discover or learn from the resultant graphs, but I guess I’d summarize my findings by saying that fan fiction fills a space that is normally left empty.
What happens when an author publishes a story? The story goes out to readers who are often changed by these stories. They might reach out to the author and let them know, but other than that, the effects of a story are often lost. In Harry Potter, we grew up with a story that taught us themes of tolerance and love that followed JK Rowling’s beliefs. Unique to writing novels is the individualism of it- you don’t see blockbuster movies created by a single person and it’s not often that you see a song written, composed by, and sung all by the same artist.
While JK Rowling was able to share some themes, there was no space for discussion about how these topics were presented, whether with fans or with co-creators. So while Harry Potter has some strong female characters, some people of color, and one revealed after the fact gay character, JK Rowling is ultimately only one person with experiences based on that of a straight, white, woman. In the spirit of seeing an already loved story become better and more representative, fans insert their own experiences and in that way, perhaps we achieve a story better than the original, or at least one more representative of humanity. And that, to me, is pretty magical.