FinOps War Stories

On the thrid week of March 2024, on table 5 we shared some FinOps War stories.
Comments, support and your war stories, are very welcome.

Here they go

From @stjn

Thx Frank! I love to hear the good the bad and the ugly. Every customer, every FinOps practice has their moments of glory and moments of defeat. Let’s hear how you handled them.
A story from last week… for me it was the first time I could show the power of the actions we took. The backdrop of the story is a multimillion cloud using organisation. I recently started to build with them a FinOps practice.
This month our biggest sponsor (moneywise) came to me. A bit dissatisfied, the message she brought was the org had an overrun on the budget. Which was an attempt to bring us a bit in a defensive position. However, the answer we could give was. It could have been worse. We did some significant savings where we could share the numbers. From being pushed in defense the answer was both rewarding for us to give but I immediately felt the atmosphere turn. From being a messenger of bad news. We received a smile and a joke about me seeing always the positive side of things. The rest of our meeting was super constructive and the buy-in was bigger after our meeting then it was before.
Winning together is so important and that was exactly what we did…
Now who is next ?

1 Like

@frank

First, it is interesting to notice that I have many war stories from my SysAdmin days and less from my FinOps days.For me, it is the CUR files from a time before the CUDOS dashboard. My focus at work in on AWS pricing, AWS commitment and reporting back to customers, and it was hard. Yes, there are native tools to look at the cost, but they are limited and often need to go through Excel which is a source of pain and errors.So I started building dashboards in QuickSight, assuming the CUR had all I needed ready to be used. I was surprised how extracting the data was a pain. Even the most simple value, like the on-demand usage equivalent, was not there and required a massive query to get (and still is). Almost any KPI of interest was not easily accessible and required a lot of elaboration, complicated queries and was (and is) hard to debug. I was happy when the CUR library came out, mostly because it showed I was not alone in feeling the pain.Now I know more of the CUR and AWS Pricing that I think I should, but as a data-boy, I like it, but would not want to start all over again.Some of my work on pricing are here:

@ben.demora

So…
Let’s say you’re a streaming platform. And as part of what you offer, you have an app that runs on end clients like games consoles and phones.
When that app is loaded but NOT currently streaming shows, you’d like to display a carousel of posters advertising all the wonderful things your customers can watch. Those posters will cycle through a small set of images, roughly every 8 to 10 seconds.
That carousel loads in images that are periodically refreshed to the application’s on-device cache storage space.
All good?
That is, until someone makes an image that is too big to fit into the application cache storage space.
And that image is served out by Akamai CDN.
So instead of the image getting pushed to devices once, it gets pushed every time the carousel flips around. On every single device with the app installed. Across the whole of Europe.
That’s how you burn around $740k in about 3 days, by accident.
Best bit? Akamai only does monthly billing runs, so the only way you catch this is through a spike in activity.

@JezBack

So here is good lesson for a fledgling FinOps Practitioner…Back in my Salad Days…
I was relatively new to Cloud Economics and the story of FinOps. I was getting super excited about seeing how much money I could save a client on a cost out exercise.
I added up all the savings from the rightsizing, the terminations and all the utilisation opportunities…then I added all the Committed Usage Discounts…and got so excited…
Then presented it to the client…who then told the people upstairs - and they got all excited.
Then I realised…you can’t add all the Rate Discounts - you need to factor in for the usage optimisations first - you idiot - yeah…left a $1.5m shortfall in the promise of savings to the people upstairs…What a throbber I was.
Silver Lining: I did manage to find more than the promised amount a little while later, but still…
The LESSON: Always account of the shifts you plan to optimise in usage cost efficiency BEFORE you plan your committed savings! (edited

I love this. Question that must crop up in every organisation, “Is FinOps worth what we’re spending on it?”, “Can we reduce the FinOps budget?”. Beautiful response is to show what would have happened without FinOps and suggest that there are more savings to come!

1 Like

This seems like a good chat in slack and a topic here :slight_smile:

I might have a kind of war story here, even if a little one.

We’re now strongly advising business units to turn off non-prod databases on nights and week-ends, so we wanted to implement that for an existing project that was using a costly DB. And it worked, we saved some money.

However, a little bit later, it happens that we discover AWS Config very high costs on the given account (price of the account goes almost x2), and the weird thing is that it’s mostly on nights and week-ends, almost nothing during the days.

We begin to investigate, no infra changes at the start of the higher costs, no relevant code changes either. And then the dev tells us : oh, but, you know, when the DB is off, the container running the app is constantly trying and restarting and trying and restarting. And it actually generates a lot of infra changes for some reason, that are recorded by AWS Config, hence the high costs.

And since our compute automation wasn’t ready yet, we decided to turn the database up 24/7 again to save some money until automation is ready : running the DB all the time costed 30x less than the AWS Config costs.

Conclusion :

  • plan for stopping non-prod nights/week-ends from the start
  • architecture your apps correctly so that they react to issues in a controlled manner
  • be ready for counter-intuitive solutions
2 Likes

At a former gig there were over 3000 development databases running below 1% average max utilization.
With the developers spread out over the globe and no fixed office hours, we opted for consolidation instead of smart scheduling. For each type of database needed (Oracle, MySQL, MS SQL, …) we created one larger instance and provided a self service portal where the developer got their own db schema on the shared instance rather than a whole instance to themselves. The cost reduction was close to 100%.
Bolstered by this success, my manager decided to do the same in production. We were tasked to implement rigorous checks (e.g., you can’t have file access or un-optimized long-running queries on shared instances) because we were obviously worried about the blast radius if something went wrong. My manager (“you don’t learn without pain”) interpreted such warnings as cowardice and we were encouraged to keep such “mistrust” to ourselves.
It was (partially) a success, because we actually managed to identify all “noisy neighbours” and things were running smoothly from day one. We consolidated 900 production databases onto one big instance with a cost reduction of over 65%. What we did not account for was that our instance cluster - I distinctly remember the utilization was around 53%, so far from max - simply stopped running. No warning, no notification, not even a log entry. We only started looking into things when we got a wave of disgruntled customers calling support. Even the CSP couldn’t explain it, they gave plenty of credits, but we lost a big customer swamping any achieved savings.

1 Like

The lesson I take from it is: while CSP have probably tamed virtualisation scalability, the software you run might not have.

What is your take on it?

Lesson: keep track of your costs daily, especially after big changes.

Scalability is easy if vertical, and very challenging when horizontal.

I can’t remember the name, but I heard of this company in trouble because Azure didn’t deliver the promised next big instance size, and they cannot scale vertically any further.

Similarly to the classic monolith vs micro-services question, you won’t create a horizontally scalable application from scratch unless you already know you’re going to need it. However, you should keep it in mind so you later can evolve the architecture without too many hurdles.

1 Like