The Audit

eCommerce and Surviving the Black Friday Frenzy with Jeff White

October 02, 2023 IT Audit Labs Season 1 Episode 27
The Audit
eCommerce and Surviving the Black Friday Frenzy with Jeff White
Show Notes Transcript Chapter Markers

The Audit - Episode 27 - Imagine managing over a million orders per minute during a high-stakes sales event like Black Friday! That's the reality Jeff White from Cockroach faced during his time at Best Buy. We sit down with him in a lively discussion, unpacking the intricacies of running a successful online store, the immense pressure involved, and strategies to guard against malicious acts and bots. Jeff enlightens us about the challenges of scaling an Oracle database to handle a mass influx of orders, sharing insightful anecdotes from his own experience.

Ever wondered how to improve your security posture and reduce organizational risk? Jeff is here to share some answers from a tech perspective. He delves into the unique features of Cockroach DB, a system he played a vital role in developing. Learn how it’s designed to run on various operating systems and its resilience to node failure. Jeff also sheds light on distributed data replication, an intriguing aspect of Cockroach DB. If you're a tech enthusiast or involved in e-commerce, this episode is packed with valuable nuggets of information to take your knowledge several notches higher.

As we wrap up our conversation, we navigate towards Jeff's interest in renewable energy. We delve into his journey with solar power and electric vehicles, outlining the financial benefits of such investments. He shares his solar installation experience and future plans. We also touch on the critical role of a robust team in conducting successful security assessments. Lastly, we consider a new venue for our game night, since our usual spot isn't available. Tune in for this enlightening episode full of expert insights and real-world experiences.

Speaker 1:

You are listening to the audit presented by IT Audit Labs. What sets us apart? We offer a full-designed framework tailored to your specific security needs that will reduce organizational risk and improve security posture. Contact IT Audit Labs and have us lead your team in outlining a strategic approach to remediate organizational risk. Today's episode we will be talking to Jeff White of Cockroach. Cockroach DB is designed to survive software and hardware failures, from server restarts to data center outages. Jeff will be giving us a demo and going in-depth on Cockroach DB's services. Thanks for listening. Jeff, your time at Best Buy, you were working in or on databases and then somehow stumbled across Cockroach, which, honestly, I had never heard of until Gretchen said hey, my husband works at Cockroach. He's like a really cool company and cool things that you guys are doing. So I was like, wow, we got to explore that a little bit more.

Speaker 2:

Yeah, I had worked for at Best Buy, first for Dell Secure Works. Then I moved actually to another company at Best Buy, verizon Enterprise Services, as a service delivery manager helping run Best Buycom. And then that was in sourced and a number of us became Best Buy employees. So for a while there I was running all the data center type operations, more or less handing off from Verizon, and then there was sort of a reorg.

Speaker 2:

I had a fair amount of database experience. I'm not a DBA, but I've managed a lot of DBAs over the years, so my we'll call it big database experience goes back all the way probably to like 97 when I started managing people who were managing Oracle. And then at US Postal Service I was managing USPScom which had of course a big database component there which was also Oracle. Then later at Best Buy still has a fair Oracle install, and so one of the things I did at Best Buy was set up engineered for them. I didn't do necessarily all the work, but I told them what I needed Was just to build what's something called Oracle Maximum Availability Architecture or MAA, which is where you have multiple rack clusters and they're synchronizing together and then it was synchronizing to another rack cluster in another location and it was very, very highly available. And that was necessary because if this database wasn't working, best Buy couldn't take orders, which creates a lot of anxiety from everyone when you can't take orders.

Speaker 1:

Which from the consumer side and this is outside looking in I remember kind of that. You know the teens era. Best Buy was having some issues in the holiday season keeping the website available for online orders as people were transitioning to more of that online platform versus the in-store platform. Any truth to that?

Speaker 2:

Oh, yes, yes, Pre-COVID, we'll call it pre-COVID. We had and lots of other e-commerce retailers had unfortunately trained their customers that the sales would all start at, let's say, midnight on Thanksgiving Day. So customers would all stampede into the website at 12.01 am on Thanksgiving and you would get this massive surge of activity. And so Best Buy learned okay, first thing, don't tell them what time. That's the first thing. Don't not tell them what time.

Speaker 2:

And I remember because we watched when I was at Best Buy, we would watch our competitors. So, for instance, I remember, a few weeks before Black Friday announced that they were going to have all of their stuff on sale at 9 pm on the day before Thanksgiving, 9 pm. You go to the website and operations that was us. We all looked at them and said I want to know how long the two things are going to happen they're going down at 9.02 pm or they're going to stay up and we're going to have to try to make the website survive. Our website survived the same assault next year when our sales marketing people decided to call the time Well, 9.02 pm, walmartcom died, which was totally predictable.

Speaker 1:

And you would think, from a security perspective as well. It's like, well, okay, that's the time that we're going to dedoss them If somebody wanted to be a malicious actor to disrupt their service.

Speaker 2:

Best Buy has a very sophisticated e-commerce pathway to get from the public to the actual website. So, for instance, everything has to flow through Akamai. If you get a bestbuycom address you'll actually see it's Akamai. And then there was filtering and DDoS protection et cetera done by Akamai. Best Buy had been an Akamai customer for a long time I'm sure they still are so there was that level. And then the web security team. They would hold back fixes for bots and such until maybe an hour or two before they were going to Black Friday and then they would pull the trigger and that was to buy themselves time from the. You know, not necessarily bad actors will call them the bots and such. They were the biggest threat from taking the site down.

Speaker 2:

So the scale of some of these e-commerce sites is pretty surprising. The footprint on how big. They would scale Over 10,000 virtual machines, 10 to 20,000 virtual machines to try to sustain the website. And you had a very complex stack of interactions and there was a you know, learn by what failed. And one of the biggest barriers we had was to scale this Oracle database every year to try to take the assault. And that's the strange part is that you would spend six months of the year for let's call it 15 minutes of workload. You're knob turning and scaling and testing to get through that 15 minutes. Our biggest challenge actually was PlayStation 5s and Xboxes. They caused about a year of heartburn to the point where I started calling them toxic waste, like the amount of money that was spent trying to sell the darn things was certainly not equal to the, in my opinion, the money that was made selling them. I can remember at one sale we were having, it was there was over a million attempted orders per minute.

Speaker 1:

Wow, Just keeping track of that and the cards. Who has it all?

Speaker 2:

that would be pretty common, Well these are attempted orders that we were blocking. You know 99.99% of them were being clocked. Oh, from bots and everything. Yeah, so there's so like you can't scale to that size. You can't. There's just nothing you can do about it. So there was actually multiple layers of defense, threat, intelligence and such like that. And then it got to be a secret and when we were going to put stuff for sale, and then we would do dumb things like get predictable.

Speaker 3:

I have to admit I was at least responsible for one of those million attempted PS5 orders and I would sign up for the Twitter Twitter alerts. People have their Twitter alerts.

Speaker 2:

I think it was. Was it Wario?

Speaker 3:

Yeah, yeah, there were a lot of different Twitter and then you could get the alerts when they would alert you when the sales went live on Best Buy or Walmart or whatever.

Speaker 2:

It was always funny. They would enable the SKU to sell. We would have because I was running, my team was running a database, so we wouldn't. We would know the inventory that we're going to sell this many thousand and you know. We would enable the SKU. And then it was always a joke to see how long does it take, how long will it take the Internet to discover we have them for sale. And that was generally about two minutes. Two minutes to start. The flow would start and then we would get smart, like if you hit reload too often, like behaving like a bot, we'd start putting you into a penalty box, like one strategy was as we, early on, they discovered that the scalpers were actually dropshipping a scalped item, basically from us, so they would buy it from us and they would have it shipped to someone. They sold it to on eBay from us for, of course, far more. So then of course they changed it, so you had to go to the store to pick it up.

Speaker 1:

Presumably, Jeff. That was before the cloud services had their scalable database architecture, Azure, SQL or.

Speaker 2:

Those won't typically scale to the size we would need. Oh really, Okay, yeah, at the time we would self-host, even if we were running on AWS or Azure or wherever we would be self-hosting. We run on our own virtual machines. We would run our database. Best Buy runs both in Cloud and in the data center. That's the time.

Speaker 1:

So how do you go from Oracle to Cockroach? Was Oracle too expensive? Didn't scale. Well, what did you find?

Speaker 2:

Best Buy had been and still is a large user of Cassandra, which is a no SQL, eventually consistent database. It's a pretty good distributed database. It has a number of limitations which are challenging from an operational perspective. They're also challenging from a programmer's perspective in that you can only have one index. It is actually eventually consistent, so it eventually consists. Like a programmer asked me well when will it become consistent? I would say eventually. That could be one second, it could be 10 seconds, it could be three days. I was tasked more or less in finding a distributed database like Cassandra. That would be an asset SQL solution.

Speaker 2:

Most e-commerce sites have a desire, aspirational or actual goal to survive. We'll call a regional failure of a Cloud provider Such that, like if AWS East goes down, all the workload goes west and we stay up and running, you get into with something like Oracle. From an operational perspective, you have normally a primary node or a primary group of nodes. All of the workload needs to go there. If there's a problem in that one or more set of servers, you have to do a. Typically most sites would do a controlled manual failover. A lot of sites would not trust automation to do this, to make the right decision Because if the automation screws up, it can make it far worse. You see this with companies having disaster recovery tests, they'll do a DR test where they want to do a failover. That's almost always revolving around the database, getting it to move workload from site A to site B and getting everything glued back and forth and everything up and running. That's the ugly part.

Speaker 2:

Where a cockroach came in is I was trying to find a solution for that in our cloud providers. I'm very familiar with clustering and shared storage, sands and azs, and you need some shared storage and replication. I knew what we'll call it angst that would create. I wanted to avoid this angst. I started looking around for an alternative solution. I came across this cockroach database. I set it up on my Mac as a couple of containers and set up a virtual cluster. It took me about an hour to get going. Then I started poking at it, doing bad things to it, and it kept running. It took me about two or three hours and I thought, well, this is it, this is the solution. We went from there.

Speaker 1:

Do you really need to concern yourself with disk resilience? Do you really need to run on anything like a RAID 10, RAID 5? Because you're distributed, do you have to invest also in that sand environment?

Speaker 2:

No, we don't need a sand. We don't need a NAS, although some customers, for their self-hosted environments, will choose to do that In the cloud providers. We're just using standard SSD block storage. We do recommend SSD because SSD versus spinning disks. I don't know why you'd use spinning disks, but in 2023. We just anticipate regular. We'll call it just a generic machine, nothing particularly special.

Speaker 3:

Jeff, if you had to describe what you're doing to a seven-year-old, a seven-year-old okay, or 10.

Speaker 4:

Getting them that PS5, that's what he's doing.

Speaker 3:

Besides that, could you give me the elevator pitch or just a quick blurb or a paragraph on what you're doing?

Speaker 2:

What I personally do is I manage a team of enterprise architects that help existing customers be successful with our product. I also do that role myself and I also fill in a role as a customer success manager where I coordinate work with the customers. I work on the customer post-sales sides. I do some pre-sales, but mostly post-sales as far as what the product does. It's essentially your data store that we want to live up to our name, which is we're hard to kill. We want to be hard to kill, which is where the name comes from. You want your database to be hard to kill, able to survive bad things happening to it. Databases have a bad and probably deserve reputation for being brittle, and when the database fails almost all the ways the application is dead, it tends to be the real weak point in the application.

Speaker 3:

I like cockroach better than dump truck. By the way, I think cockroaches.

Speaker 2:

Cockroach. It has an interesting reaction. When you tell people the name, they either think it's very black or white. They'll say it's what a horrible name. Why would you call yourself that? Or hey, that's a great name, and the positive part to it is is that I've never seen anyone forget the name.

Speaker 1:

Jeff, for some reason, and maybe I have the wrong impression of cockroach, but I'm picturing it widely distributed, almost like, if you remember, back in the late 90s, SETI, the search for extra-freshional, extra-freshreal intelligence, and you could have 10,000 people working on a particular data set right and kind of the sum of all of these small parts made something big. Where is that? Similar in concept, where the data is spread wide but not very deep, or is it a smaller number of nodes but lots of data?

Speaker 2:

That's actually configurable by the customer or the database. Our default replica factor is three. We're gonna have three replicas and I can kind of show you that here. So this is actually a console of one of the nodes in the cluster. This was actually in my basement. One interesting thing about cockroaches is of all nodes are equal. Any node can handle any and every database transaction and operation. So the strange like this console here is on every node in the cluster the same console. You would see the same information. So if I go drill in here so I have Europe and the United States. And if I drill into the US here I've set up here in Minnesota there's four nodes in my basement and I've set up three nodes in AWS East two, which is Ohio, and I've set up three nodes in East one.

Speaker 1:

And it'll run on Windows, Unix, Linux.

Speaker 2:

Our primary platform is to run on Linux. We do release from four. We do a Windows build and most all of our development's really done on Macs, so we do a Mac build too. Windows is not really we release a build, but we don't really support that production. We lean more towards just running on Linux.

Speaker 1:

I heard you did a mainframe build too. I did do a mainframe build.

Speaker 2:

We do actually run on ZLinux, on S390. And we do run on different architectures. So over here this is AMD64, and I set up in Europe. Here these are ARM64 nodes. In the EU we tend to for our real production. Real customers recommend homogeneous nodes, like we like to see all the nodes equal. It's challenging to get dissimilar performance nodes to perform well as a group because it's hard for the underlying system to determine how much workload it can throw at one node. So we kind of make the. We sort of assume that all the nodes are equal in capability.

Speaker 2:

This is I've loaded a few databases into here. So there's a TPCC benchmark running. This is sort of a fake Uber type of company and this is something like a bank would do. And then you'll see these nodes regions. So if we look at bank you can see what the table is, what the grants, sql, and you can see there's replicas. There's actually all these nodes here are involved with that database. So if we go back here we'll see these are probably mostly the European nodes, because I'm hitting a European node over here we have the concept of follow the workload, so as activity starts hitting one area or one region, we'll start to move.

Speaker 2:

Every range has what's called a leaseholder. A leaseholder is the node that is owned by a node and that leaseholder is where you have to go to if you're gonna do an update. So there's a leaseholder and there's replicas. The leaseholder can move around. So we'll move that leaseholder around, based on workload or if there was a failure.

Speaker 2:

So what I'm gonna do here, I'm gonna stop this and restart over a longer time, for a longer duration, so it keeps running. So we have one region over here and I'm gonna take this node and I'm gonna nuke this node. So nine there. So now I've killed one node in the EU and right away you see this node. It went suspect because it no longer was talking and we can see it here. It's gone suspect and now we have under replicated ranges. So, because we have one node down and we had a survivability on it, the database cluster knows that there are some nodes, some ranges that do not have enough replicas. Once we declare it dead, we're gonna up, replicate these unavailable, these under replicated ranges, and we'll create more replicas out on all the nodes. And so again, see, our workload is still running over here. We killed one node, the database is still up and running and we don't have any issues. The application's still going and we should hopefully see here in a minute we'll see this go to red and then we'll start seeing it up replicating.

Speaker 4:

And Jeff, in theory you can withstand just one less than half of nodes to fail in a given cluster and still keep things chugging along, kind of.

Speaker 2:

Yes, so you need more than half of the total number of ranges. We don't have a split brain type of scenario, but that would be. You would not want. You want everything in an odd number, like in Thraze.

Speaker 4:

Yep so we could have six of them down here and it would eventually get back to its happy place.

Speaker 2:

Yes, now, okay, so, like here, it went dead. We declared it dead and you see, this number is now. We're up replicating, so the cluster, this number, will start driving down and we've created new replicas. Now if the node comes back online, we'll recognize that internally and we'll either catch it up or we'll just re-create new replicas out there. We kind of know how the rate of change on the database and which strategy is the best way to catch the node up and turning it back on.

Speaker 2:

Essentially is what you would do for adding more nodes. Now, if you killed too many nodes to your question about if I have less than 50%, we will, those will be considered what's called unavailable ranges. We don't have quorum for that. If it's unavailable, part of your database most likely is not gonna be there, but if you were to restore the nodes, they'll come back. You don't have to go back to your backups or restores or anything like that. So I'm gonna start this node back up. Yeah, so now we see it there.

Speaker 2:

Now the interesting thing we can do is and I can demonstrate this is we can make what's called online schema changes so you can change.

Speaker 2:

You don't have to define like the survivability goals or the localization of the database when you set it up. You can change that on the fly. We have customers that are gambling companies, so they need to keep data domiciled in each of the states they are legally operated to. The customer data has to stay within the state, so they have clusters within each state they do business and they then can simply by saying, okay, you are, let's say, a customer of Florida, you need to stay in the Florida. You know their data has to stay in the Florida cluster. You can do the same thing sort of thing with, like, I'm a multinational company, I wanna keep my EU customers in the EU database, I want my American, us customers in American clusters. And we have a concept of what we call super regions, which we added. So, like, if you have multiple regions in Europe, we'll create enough replicas and we know that those regions in Europe go together.

Speaker 1:

That's a great overview. Jeff, really appreciate that. Can we pivot slightly and talk a little bit about solar and maybe that's the wholly different episode, but I know you've done a little bit of work with solar and just curious on your thoughts so far on your journey.

Speaker 4:

Jeff, real quick. Did your mind immediately go to Apache, Lucene and solar before?

Speaker 2:

it went to solar power. It actually did, because I used to have to when I was at Best Buy. Solar was one of the databases we did run, so for me.

Speaker 4:

I saw the wheels turning. I thought maybe that's where they were going.

Speaker 2:

Yeah, I went there right away. But yeah, I did actually have I'm sure my wife talked about it had solar panels installed December of all things this last year, which was kind of nuts. They actually shoveled the roof off and they must have spent half a day. It's been pretty good. I'm fairly pleased with it?

Speaker 1:

Was it what you expected from the research you had done? To what it was in practice?

Speaker 2:

Yeah, I was pretty much on with what I thought it would be. To some degree it's always too expensive. You're always kind of like, oh, this was expensive. But in this case it's like, okay, it doesn't raise my property taxes, it adds value to the house. So it's kind of an asset and it kind of effectively makes money every month and it's going to last, unless the house gets destroyed, 20 plus years. So it's sort of. If there's anything I regret, it's not actually putting up a few more panels on the side. That gets more shade.

Speaker 1:

You'll truly be able to be off grid, where you charge those batteries up during the day and then use them when it's dark.

Speaker 2:

Kind of my plan is. It might be to do the opposite, which is change the power meter to time of day billing. I did also buy an electric vehicle.

Speaker 1:

How do you like it?

Speaker 2:

Well, it's pretty much everything. I expected it's after you drive an electric vehicle. Internal combustion vehicles are kind of like this is dumb. I mean, just ignore the whole environmental left, right, the whole political thing around. It's like internal combustion cars are not very good vehicles compared to an electric vehicle.

Speaker 2:

An electric vehicle if you were just a pure, we'll call it a pure motorhead. Well, they do have electric motors in them. But if you're a pure motorhead like this car's, it's got a torque up the wazoo, it has no transmission and has no, it doesn't even probably have a differential and it just you push down the accelerator and it goes like a bat out of hell. So and it breaks. Well, like that's another weird thing you got to get used to is like breaking you don't really break. And then the cost, which is another thing which I think is going to be the killer behind why electric vehicles will take over everything as a set. The best part is no part. And if you look at the number of parts differences between an electric car and internal combustion engine car, it's like I don't have any of these parts. I bought the car and the dealer says, well, you get free oil change and then they can jokingly said like you need one.

Speaker 1:

Well, that's good. You've had a good experience, then, so far with the solar panels.

Speaker 2:

Yeah, the solar's worked well. I can be nerdy and watch it up to the set. Actually I have a tool that's watching the power consumption on the actual power feeds, so it's kind of entertaining. And my roof's not really kind of optimal for solar production. But I, on paper, have a 10 kilowatt but I've never seen any above eight. But that's mostly the angle in the direction of the sun. Again, the roof's not really was never designed for it, so I take what I get. But it's certainly like May was cool, so you didn't really run on the air conditioner and our electric bill went negative for May, so I was pretty happy with that. I was pretty getting kind of tired of paying Excel.

Speaker 4:

Yeah, what did they pay back? These days for net metering for negative balances, they pay back what they charge you. They do, okay, I thought it was less maybe.

Speaker 2:

Well, you know, they tack on a bunch of fees, of course per QA, which you don't get paid for. So it's technically what they charge for you. It's kind of funny. The electrician for the company I had the panels installed with he was jokingly saying well, you know, excel's going to hate you now and he jokingly, he was kind of serious. He says you know, excel views you all the solar panel orders as the enemy.

Speaker 4:

That's what all the YouTube ads I get these days Tell me like hey, you know what Excel's doing to you. You can stick it to them If you just take out this big.

Speaker 2:

Yeah, well, there's some truth to that and you know, the third back from Uncle Sam is a big, is a big factor. I'm actually having a look at putting a heat pump in, replacing our furnace with a heat pump and then a heat pump hot water heater. I don't think we'll do that again. The government is going to give you a third back, so 30% back. So it's like, well, you know, we kind of be dumb, not to. That's the way I look at it with the solar panels is like how many places could you go buy something, a box that pays 10 to 15% back a year and lasts 20, 25 years, and the government will give you 30% back for buying the box. Once your question be how many boxes can I buy? I mean, that's kind of where we're at, like you know, when you, when you, when you look at it from a purely financial perspective, it's like this is a pretty good investment.

Speaker 1:

Really financial. You would think the best place to put it would be on open land, so that you didn't have the costs incurred on the roof, and then you could sell it back. You know three, four times over.

Speaker 2:

Well, it's kind of. I think it's going to be a little interesting with Minnesota. You know they make they grow a lot of corn to sell as ethanol, which is is, I think mathematically is a loser. Like the amount of energy it takes to grow the corn, harvest the corn, turning it, ethanol is more fuel it took to do that, so it's actually a net loss. You get into. The same question is why would I grow corn there when I could put some solar panels on that field and Don't require any maintenance and and don't require any planting or anything like that, and doesn't care if it rains or not? Why wouldn't I do that?

Speaker 1:

I thought they were using the diesel, though Essentially to use it for a generator that produce electricity.

Speaker 2:

Yes, yes, so they're diesel electric locomotives. So the diesel is just a prime mover generating electricity. The there's electric motors, which I tell people. If you don't think an electric motor can pull a lot, look at a locomotive. I don't know if you notice. There used to be a electric Railroad that went through Idaho, over the Rockies and over the Cascades. That was created around 1905 and Existed until 1970s and it was electrified. It was the Milwaukee Road, was electrified over Over the Rocky Mountains and they were electrified for efficiency and so they didn't create forest fires from steam locomotives. And oh, and again 1905, they had fully electric main line freight line. So if someone says they, they had. I said, if someone thinks they can't do it, they did it a long time ago. I.

Speaker 1:

Think that's the same thing with the cars too. Right, one of the first cars was all electric. Oh Boy, well cool. We covered a lot of stuff Today. Jeff, thanks for coming on. Do want to thank Jeff and Jeff will. We'll send you a little something here too to say thanks for for coming on. And we do a game night every so often where we play some board games like coup and resistance and Just some other fun stuff like that. So we'll let you know the next time we play. We're right downtown St Paul.

Speaker 2:

I'll ask Gretchen if I can be invited. She might not want me there. She likes having her her tribe.

Speaker 1:

The more, the merrier I say.

Speaker 4:

We need a new place, though. Eric Allery is shut down now, so we're shopping for a new spot.

Speaker 1:

I think we could go to the office now. We're just about ready.

Speaker 4:

There you go.

Speaker 1:

I to audit labs is a team of professionals that assess security, risk and compliance. Our threat assessments find the soft spots before the bad guys do. Whether you are looking for a point solution or a broader security program, contact it audit labs to reduce your organizational risk, thanks to our producer, joshua J Schmidt, and our audio video editor, cameron Troy Hill.

Exploring Cockroach DB
Understanding CockroachDB's Distributed Data Replication
Solar Power and Electric Vehicles
Thanking Jeff and Discussing Game Nights