Photo by Maxim Hopman on Unsplash
My last post was almost exactly four months ago! In that time I’ve…
Wrapped up my time at Meta (and put out a research paper from my work)
Successfully defended my dissertation 🎉
Started a new gig at the New York Times, as a data scientist
As you might imagine, I put this and all other outside projects on hold while I finished the dissertation. Now that I have free time again, I’m hoping to get back into non-academic writing, including in this newsletter.
To refresh from that last post, I’m broadly interested this year in simulation, sequence encoding, and how they interact with novel attention markets. There’s been a lot of activity in all three areas! I’d like to do a deeper dive on some of this soon, but for now, let’s talk about how weird the current landscape of social media data is (and a Python package I just released).
One way to understand how attention works online is to collect a bunch of data from a social media platform. Doing that depends on access to data from large platforms, through an API or other means. But now, open APIs are disappearing, other methods aren’t keeping up with rapidly proliferating platforms, and everything has been complicated by the arrival of generative AI.
Twitter is a great example of the current state of social media APIs. For years, Twitter was the favorite platform of social media researchers—it allowed easy access to enormous samples of tweets, making it a sort of de facto choice for published work on collective digital behavior. Now, the company is charging upwards of $40,000 a month for its API, pricing out basically everyone. Researchers are left without a way to measure collective behaviors at a large scale, undercutting a valuable resource for computational research.
And broadening the search doesn’t turn up any great alternatives. Large scale platforms that offer easy access to well-structured data are in short supply. TikTok has a research API, but it’s only accepting applications from researchers based in the US. Data around organic user behavior on Facebook is limited to URL sharing activity, and that is similarly locked behind an application process. For anyone who can’t get access through these processes, we have entered a post-API world.
Of course, there’s an argument to be made that increased scrutiny and decreased access are a positive development. User data is no longer getting thrown around with abandon, companies are finally getting serious about privacy, and large-scale tracking of our collective preferences and behaviors is a little less viable for bad actors. Some of that may be true, but reduced access doesn’t equate to elimination. Companies still leverage our data to monitor our behaviors and train large-scale models. And while that work continues, external researchers whose approach is more critical—trying to understand the social impact of these systems, or to dissect algorithmic recommendations—may be hampered. We are left with an uneven playing field, in which many of the opaque processes governing our largest platforms are allowed to continue unabated.
In such an environment, there are still ways to get access to social media datasets. Some websites make calls to internal APIs when they surface data, and those endpoints might accept requests from anywhere. This is the approach my recent Python package takes to allow data collection from Substack—hitting internal endpoints with requests for newsletter data returns JSON objects with all sorts of useful information. It takes some reverse engineering to figure out how to query internal endpoints, what kinds of data they return, and what limitations they include, but it can be a powerful way to programmatically build datasets. In other cases, there are great external resources with high-quality data ingestion. Jason Baumgartner’s Pushshift, which collects data from Reddit (and other platforms), is a prime example. And finally, if all else fails, web scraping methods of varying sophistication can work on many websites.
The problem with these approaches is the intensive work they require, compounded by the range of interesting platforms currently proliferating across the web. Monolithic platforms are still dominant, but think of the range of places someone might spend their time online now: watching YouTube, browsing through channels in myriad Discord servers, messaging friends in group chats on WhatsApp, tuning into livestreams on Twitch. If you wanted to fully understand the dynamics of online behavior, even at the coarsest level, you would have to build and maintain data collection pipelines across so many of these platforms. And each of them comes with unique challenges, either because of the platform’s architecture (how do you build a representative sample of Discord activity?) or because of the medium (how do you store and extract information from massive amounts of short form video?). It’s a seemingly intractable tension—the tools we use to understand the web are all geared toward narrow contexts, while people increasingly cross those contexts on a day-to-day basis.
In the past, I would have argued for a relatively straightforward (but still challenging!) approach to this obstacle. Build open-source tools geared at popular platforms. Make each of them a composable part of a broader framework of data collection, enabling flexible approaches to building high-quality samples. Place this project within a larger funded research initiative, to ensure ongoing maintenance and access. As I mentioned above, studying these environments is asymmetrical from the jump, and open source software is one way to help level the playing field. But, the calculation has changed significantly with the introduction of generative AI.
Large-scale models require lots of high-quality training data. Some training data is freely available on the web, some is available on the web but restricted by licensing or copyright, and some is in a gray area—available, but probably not earmarked by its creator for use in a training set. Model creators exercise varying amounts of caution around these distinctions: see concerns around GitHub Copilot reproducing copyrighted code. But regardless of the ethical position of any individual model creator, there is always a risk that broadening access to text, images, and other kinds of valuable data opens a channel for unauthorized use of somebody’s work output.
The version of the Substack API package that I published is drastically reduced from what I used in my research—I didn’t feel comfortable making the capability to download every public Substack post in one go widely available. Writers use Substack as the medium for their copyrighted work, and as a convenient channel to broadcast that work to opt-in audiences. They (likely) don’t use it to add their intellectual output to a giant slush pile of training text, increasing the possibility that companies use an AI to replace them down the road. Given the pace of development, even the most remote possibility of furthering that second condition felt like a tangible enough concern to factor into software development. It’s certainly on the minds of platform owners—Reddit and Stack Overflow are both exploring ways to monetize their content as training data. I imagine similar scenarios will unfold elsewhere in open source software for the foreseeable future.
Facing a general drop in access large datasets, the best path forward for social media researchers might be creative use of less data. Data donations are exports of user data—retrieved via GDPR-mandated functionality or other means—provided by users themselves. While smaller in scale, these exports provide deep information on individual behavior, and are collected with the full consent of the user. Similarly, some studies ask participants to install a browser extension, which collects web usage data for researchers to analyze. And while these datasets might be orders of magnitude smaller than what could historically be had from a platform’s API, they could also provide a seed set for data augmentation. One could imagine a world in which large-scale behavioral datasets are synthesized along important characteristics, by processes benchmarked against their real-world counterparts.
Regardless of the path forward, it seems increasingly clear that changes in the platform landscape are pushing large-scale, computational research into unfamiliar territory. How this work fits into the fight between data-hungry LLMs and copyright-sensitive platforms remains to be seen. But reducing our reliance on platform-provided research data, finding new ways to construct realistic simulacra of collective behavior, and continually examining new types of online interaction can help chart a course for the future of computational social media research.