UC Berkeley migrates to Filestore for JupyterHub

Creating a Filestore Persistent volume claim

The results were impressive. Even with Shane intentionally pushing the limits of the test instance, other users on the hub experienced no noticeable impact. They were able to open and run notebooks, download datasets, and generally work without interruption. It was a real testament to Filestore’s ability to isolate workloads and ensure consistent performance, even under demanding conditions.

This gave UC Berkeley the confidence to move forward with a larger-scale deployment. They were particularly impressed by Filestore’s read/write/latency performance, which met or exceeded their expectations. Although the Basic tier doesn’t support scaling storage capacity down and Filestore Regional and Zonal do support scaling storage capacity down, they weren’t the right fit for their needs due to cost and storage capacity considerations.

After deciding on Filestore Basic, Shane and his team had to act fast. They had a hard deadline: the start of the Spring 2023 semester, was just a few short weeks away. This meant a complete redeployment of their JupyterHub environment on GKE, with Filestore as the foundation for their storage needs. Careful planning and efficient execution were critical.

Shane and his team had some important decisions to make. First up: how to structure their Filestore deployment. Should they create a shared instance for multiple hubs, or give each hub its own dedicated instance?

Given the scale of Datahub and the critical importance of uptime, they decided to err on the side of caution – a decision undoubtedly influenced by their past experiences with storage-related outages. They opted for a one-to-one ratio of Filestore instances to JupyterHub deployments, effectively over-provisioning to maximize performance and reliability. They knew this would come at a higher cost, but they planned to closely monitor storage usage and consolidate low-usage hubs onto shared instances after the Spring 2023 semester.

The next challenge was to determine the appropriate size for each Filestore instance. Without historical data to guide them, they had to make some educated guesses. Since Datahub is designed for flexibility, they couldn’t easily enforce user storage quotas – a common challenge with JupyterHub deployments.

They turned to what data they did have, reviewing usage patterns from previous semesters where user data was archived to Cloud Storage. After some back-of-the-napkin calculations, they settled on a range of instance sizes from 1TB to 12TB, again leaning towards over-provisioning to accommodate potential growth.

Once the fall semester ended and they’d archived user data, the real work began. They created the Filestore instances, applied the necessary configurations (including NFS export and ROOT_SQUASH options), and even added GKE labels to track costs effectively — gotta love a bit of cost optimization!

With the data in place, it was time for the final switchover. They updated their JupyterHub configurations to point to the new Filestore instances, deleted the remnants of their old NFS setup, and with a mix of anticipation and relief, relaunched Datahub.

Managing Filestore

Since migrating to Filestore, Shane and his team at UC Berkeley have enjoyed a level of stability and performance they hadn’t thought possible. In their words, Filestore has become a “deploy-and-forget” service for them. They haven’t experienced a single minute of downtime, and their users — those thousands of students depending on Datahub — haven’t reported any performance issues.

At the same time, their management overhead has been dramatically reduced. They’ve set up a few simple Google Cloud alerts that integrate with their existing PagerDuty system, notifying them if any Filestore instance reaches 90% capacity. However, these alerts are rare, and scaling up storage when needed is straightforward.

To further optimize their usage and control costs, they’ve implemented a simple but effective strategy. At the end of each semester, they archive user data to Cloud Storage and then right-size their Filestore instances based on usage patterns. They either create smaller instances or consolidate hubs onto shared instances, ensuring they only pay for the storage they need. Rsync remains their trusty sidekick for migrating data between instances — a process that, while time-consuming, has become a routine part of their workflow.

The good, the challenging, and the (occasionally) unpredictable – Shane’s version

When reflecting on UC Berkeley’s Filestore journey, Shane didn’t sugarcoat things. They learned a lot, and not everything has been easy. In the spirit of transparency, here’s a breakdown in Shane’s own words of the experience into the good, the challenging, and the (occasionally) unpredictable.

The good

Nothing beats peace of mind – especially in the middle of a semester. Moving to Filestore has been a game changer, allowing the team to trade midnight debugging sessions for restful nights of sleep. No more frantic calls about crashed servers or rescheduled exams — Filestore’s uptime has been rock-solid, and its performance at the Basic tier has been more than enough to keep pace with our users.

And as we dug deeper into Filestore, we discovered even more ways to optimize our setup by improving UC Berkeley operations:

Sharing is caring (and cost-effective!): We found opportunities to consolidate hubs with smaller storage requirements onto shared instances, for greater cost-savings.

Right-sizing is key: We’ve become pros at aggressively resizing and adding storage only when needed.

Exploring Filestore Multishare CSI driver: We’re actively looking at the Filestore Multishare capability to streamline our ability to scale storage capacity up and down, and any potential cost deltas. This may save us further time and effort compared to our current Filestore deployment, but we are currently unable to do so as we are using the Basic HDD Tier.

Empowering our faculty: We’re working closely with faculty and instructors to help them educate students about data management best practices, and giving them friendly reminders that only downloading multiple megabytes of data (as opposed to terabytes) can be really impactful.

Smarter archiving: We’re continually analyzing our storage metrics and usage behavior to optimize our archiving processes. The goal is to archive only what’s necessary, when it’s necessary.

The challenging

That’s not to say there are no drawbacks. Filestore isn’t exactly a budget-friendly option. Our cloud storage costs did go up and while we’ve managed to mitigate some of that increase through our optimization efforts, there’s no denying the price tag. However, an increase in cloud costs is well worth the collective sanity of our team!

One thing we’re still grappling with is the lack of easy down-scaling in Filestore Basic. It’s not that it’s technically difficult, but manually resizing instances does take some time and can disrupt our users, which we obviously want to avoid. At the same time, we’re getting better at forecasting our storage needs, and the tips we’ve outlined — especially around right-sizing — have made a huge difference. But having a more streamlined way to scale down on demand would be a huge win for us. It could save us thousands of dollars each month — money we could redirect towards other critical resources for our students and faculty.

The unpredictable

Data science is, by its very nature, data-intensive. One of our biggest ongoing challenges is predicting just how much storage our users will need at any given time. We have thousands of students working on a huge variety of projects, and sometimes those projects involve datasets that are, well, massive. It’s not uncommon for us to see a Filestore instance grow by terabytes in a matter of hours.

This unpredictable demand creates a constant balancing act. We want to make sure our SRE team isn’t getting bombarded with alerts, but we also don’t want to overspend on storage we might not need. It’s a delicate balance, and we often err on the side of caution — making sure our users have the resources they need, even if it means higher costs in the short term.

As of now, Filestore now makes up about a third of our total cloud spend. So while we’re committed to making it work, we’re constantly looking for ways to optimize our usage and find that sweet spot between performance, reliability, and cost.

In conclusion

UC Berkeley’s journey highlights a critical lesson for anyone deploying large-scale pedagogical platforms as force-multipliers for instruction: as the number of JupyterHub deployments grow in complexity and scale, so too do the demands on supporting infrastructure. Achieving success requires finding solutions that are not just technically sound but also financially sustainable. Despite challenges like a higher price tag, a slight learning curve with Filestore Basic, and some missing automation tools, Filestore proved to be that solution for Datahub, providing a powerful combination of performance, reliability, and operational efficiencies, and empowering the next generation of data scientists, statisticians, computational biologists, astronomers, and innovators.

Are you looking to improve your JupyterHub deployment? Learn more about Filestore and GKE here.