Another must-read post this month for many of you was Transparent SLIs: See Google Cloud the way your application experiences it, announcing the availability of detailed data insights on GCP services that your workloads use—helping you see like a Google site reliability engineer (SRE). These new service-level indicators (SLIs) go way beyond basic uptime and downtime to delve into response codes, latency and more. You can then separate out metrics by GCP service to see things like API version, location and protocol. The result is that you can filter and sort to get extremely fine-grained information on your software and the GCP services you use, which helps cut resolution times and improve the support experience. Transparent SLIs are available now through the Stackdriver monitoring console. Learn more here about the basics of using SLIs and other SRE tools to measure and manage availability.
It’s also now faster and easier to find production-ready commercial Kubernetes apps in the GCP Marketplace. These apps are prepackaged and configured to get up and running easily, whether on Kubernetes Engine or other Kubernetes clusters, and run the gamut from security, data analytics and developer tools to storage, machine learning and monitoring.
There was obviously a lot to talk about at the show, and you can get even more detail on what happened at Next ‘18 here.
Building the cloud back-end
For all of you developing cloud apps with Java, the availability of Jib was an exciting announcement last month. This open-source container image builder, available as Gradle and Maven plugins, cuts out several steps from the Docker build flow. Jib does all the work required to package your app into a container image—you don’t need to write a Dockerfile or even have Docker installed. You end up with faster builds and reproducible container images.
And on that topic, this best practices for building containers post was a hit, too, giving you tips that will set you up to run your environment more smoothly. The tips in this blog post cover graceful application shutdowns, how to simplify containers and how to choose and tag the container images you’ll use.
It’s been a busy month at GCP, and we’re glad to share lots of new tools with you. Till next time, build away!
Of course, there's a fair bit of missing glue in the above diagram. For example, how do we take an UpdateReadings command and get it into Bigtable? This is where my favorite serverless service comes into play: Cloud Functions! How do we install devices? Cloud Functions. How do we create organizations? Cloud Functions. How do we access data through an API? Cloud Functions. How do we conquer the world? Cloud Functions. Yep, I’m in love!
Alright, now we have our baseline, let’s spend the rest of this post exploring just how we go about implementing each part of our architecture and dataflows.
First step: inbound data
Our smart city platform is nothing more than a distributed network of internet-connected (IoT) devices. These devices are composed of one or more sensors that capture readings and their designated gateways that help package this data and send it through to our cloud.
For example, we may have an in-ground sensor used to detect a parked car. This senor reports IR and magnetic readings that are transferred through RF (radio frequencies) to a nearby gateway. Another example is a smart trash can that monitors capacity and broadcasts when the bin is full.
The challenge of IoT-based systems has always been collecting data, updating in-field devices, and security. We could write an entire series of articles on how to deal with these challenges. In fact, the burden of this task is the reason we haven’t seen many sophisticated, generic IoT platforms. But not anymore! The problem has been solved for us by those wonderful engineers at Google. Cloud IoT Core is a serverless service offered by Google Cloud Platform (GCP) that helps you skip all the annoying steps. It’s like jumping on top of the bricks!
Wait . . . does anyone get that reference anymore? Mario Brothers. The video game for the NES. You could jump on top of the ceiling of bricks to reach a secret pipe that let you skip a bunch of levels. It was a pipe because you were a plumber . . . fighting a turtle dragon to save a princess. And you could throw fireballs by eating flowers. Trust me, it made perfect sense!
Anyway! Cloud IoT Core is the secret passage that lets you skip a bunch of levels and get to the good stuff. It scales automatically and is simple to use. Seriously, don’t spend any time managing your devices and securing your streams. Let Google do that for you.
So, sensors are observing life in our city and streaming this data to IoT Core. Where does it end up after IoT Core? In Cloud Pub/Sub, Google’s serverless queuing service. Think of it as a globally distributed subscription queue with guaranteed delivery. The result: our vast network of data streams has been converted to a series of queues that our services can subscribe to. This is our inbound pipeline. It scales nearly infinitely and requires no operation support. Think about that. We haven’t even written any code yet and we already have an incredibly robust architecture. Trust me, it took my team only a week to move over existing device onramp to IoT Core—it’s that straightforward. And how many problems have we had? How many calls at 3 AM to fix the inbound data? Zero. They should call it opsless rather than serverless!
Anyway, we got our data streaming in. So far, our architecture looks like this:
While we’re exploring a smart city platform made from IoT devices, you can use this pipeline with almost any architecture. Just replace the IoT Core box with your onboarding service and output to Pub/Sub. If you still want that joy of serverless (and no calls at 3 AM), then consider using Google Cloud Dataflow as your onramp!
What is Dataflow? It's a serverless implementation of a Hadoop-like pipeline used for the transformation and enriching of streaming or batch data. Sounds really fancy, and it actually is. If you want to know just how fancy, grab any data engineer and ask for their war stories on setting up and maintaining a Hadoop cluster (it might take awhile; bring popcorn). In our architecture, it can be used to both onramp data from an external source, to help with efficient formation of aggregates (i.e., to MapReduce a large number of events), or to help with windowing for streaming data. This is huge. If you know anything about streaming data, then you’ll know the value of a powerful, flexible windowing service.
Ok, now that we got streaming data, let’s do something with it!
Second step: normalizing streaming data
How is a trash can like a street lamp? How is a parking sensor like a barometer? How is a phaser like a lightsaber? These are questions about normalization. We have a lot of streaming data, but how do we find a common way of correlating it all?
For IoT this is a complex topic and more information can be found in this whitepaper. Of course, the only whitepaper in most developers lives comes on a roll. So here is a quick summary:
How do we normalize the data streaming from our distributed devices? By converting them all to geolocated events. If we know the time and location of a sensor reading, we can start to colocate and correlate events that can lead to action. In other words, we use location and time to help use build a common reference point for everything going on in our city.
Fortunately, many (if not all devices) will already need some form of decoding / translation. For example, consider our in-ground parking sensor. Since it's transmitting data over radio frequencies, it must optimize and encode data. Decoding could happen in the gateway, but we prefer a gateway to contain no knowledge of the devices it services. It should just act as the doorway to the world wide web (for all the Generation Z folks out there, that’s what the "www." in urls stands for).
Ideally, devices would all natively speak "smart city" and no decoding or normalization would be needed. Until then, we still need to create this step. Fortunately, it is super simple with Cloud Functions.
Cloud Functions is a serverless compute offering from Google. It allows us to run a chunk of code whenever a trigger occurs. We simply supply the recipe and identify the trigger and Google handles all the scaling and compute resource allocation. In other words, all I need to do is write the 20-50 lines of code that makes my service unique and never worry about ops. Pretty sweet, huh?
So, what’s our trigger? A Pub/Sub topic. What’s our code? Something like this:
function decodeParkingReadings( triggerInput ) {
parsePubSubMessage(triggerInput)
.then(decode)
.then(converToCommand)
.then(PubSub.topic(‘DeviceCommands’).publish)
}
If you’re not familiar with promises and async coding in JavaScript, the above code simply does the following:
Parse the message from Pub/Sub
When this is done, decode the payload byte string sent by the sensor
When this is done, wrap the decoded readings with our normalized UpdateReadings command data struct
When this is done, send the normalized event to the Device Readings Pub/Sub
Of course, you’ll need to write the code for the "decode" and "convertToCommand" functions. If there's no timestamp provided by the device, then it would need to be added in one of these two steps. We’ll get more in-depth into code examples in part three.
So, in summary, the second step is to normalize all our streams by converting them into commands. In this case, all sensors are sending in a command to UpdateReadings for their associated device. Why didn’t we just create the event? Why bother making a command? Remember, this is an event-driven architecture. This means that events can only be created as a result of a command. Is it nitpicky? Very. But is it necessary? Yes. By not breaking the command -> event -> command chain, we make a system that's easy to expand and test. Without it, you can easily get lost trying to track data through the system (yes, a lot more on tracking data flows later).
So our architecture now looks like this:
Data streams coming into our platform are decoded using bespoke Cloud Functions that output a normalized, timestamped command. So far, we’ve only had to write about 30 - 40 lines of code, and the best part . . . we’re almost halfway complete with our entire platform.
Now that we have commands, we move onto the real magic. . . storage. Wait, storage is magic?
Third step: storage and indexing events
Now that we've converted all our inbound data into a sequence of commands, we’re 100% into event-driven architecture. This means that now we need to address the challenges of this paradigm. What makes event-driven architecture so great? It makes sense and is super easy to extend. What makes event-driven architecture painful? Doing it right has been a pain. Why? Because you only have commands and events in your system. If you want something more meaningful you need to aggregate these events. What does that mean? Let’s consider a simple example.
Let’s say you’ve got an event-driven architecture for a website that sells t-shirts. The orders come in as commands from a user-submitted web form. Updates also come in as commands. On the backend, we store only the events. So consider the following event sequence for a single online order:
You cannot get an accurate view of the current order by looking at only one event. If you only looked at #2 (Address Changed), you wouldn’t know the item quantities. If you only looked at #3 (Quantity Change), you wouldn’t have the address.
To get an accurate view, you need to "replay" all the events for the order. In event-driven architecture, this process is often referred to as "hydrating." Alternatively, you can maintain an aggregate view of the order (the current state of the order) in the database and update it whenever a new command arrives. Both of these methods are correct. In fact, many event-driven architectures use both hydration and aggregates.
Unfortunately, implementing consistent hydration and/or aggregation isn’t easy. There are libraries and even databases designed to handle this, but that was before the wondrous powers of serverless computing. Enter Google Cloud Bigtable and BigQuery.
Bigtable is the database service that Google uses to index and search the entire internet. Let that sink in. When you do a Google search, it's Bigtable that gets you the data you need in a blink of an eye. What does that mean? Unmatched power! We’re talking about a database optimized to handle billions of rows with millions of columns. Why is that so important? Because it lets us do event-driven architecture right!
For every command we receive, we create a corresponding event that we store in Bigtable.
Wow. This blogger bloke just told us that we store data in a database. Genius.
Why thank you! But honestly, it's the aspects of the database that matters. This isn’t just any database. Bigtable lets us optimize without optimizing. What does that mean? We can store everything and anything and access it with speed. We don’t need to write code to optimize our storage or build clever abstractions. We just store the data and retrieve it so fast that we can aggregate and interpret at access.
Huh?
Let me give you an example that might help explain the sheer joy of having a database that you cannot outrun.
These days, processors are fast. So fast that the slowest part of computing is loading data from the disk (even SSD). Therefore, most of the world’s most performant systems use aggressive compression when interacting with storage. This means that we'll compress all writes going to disk to reduce read-time. Why? Because we have so much excess processing power, it's faster to decompress data rather than read more bytes from disk. Go back 20 years and tell developers that we would "waste time" by compressing everything going to disk and they would tell you that you’re mad. You’ll never get good performance if you have to decompress everything coming off the disk!
In fact, most platforms go a step further and use the excessive processing power to also encrypt all data going to disk. Google does this. That’s why all your data is secure at rest in their cloud. Everything written to disk is compressed AND encrypted. That goes for their data too, so you know this isn’t impacting performance.
Bigtable is very much the same thing for web services. Querying data is so fast that we can perform processing post-query. Previously, we would optimize our data models and index the heck out of our tables just to reduce query time inside the database. When the database can query across billions of rows in 6 ms, that changes everything. Now we just store and store and store and process later.
This is why storing your data in a database is amazing, if it's the right database!
So how do we deal with aggregation and hydration? We don’t. At least not initially. We simply accept our command, load any auxiliary lookups needed (often the sensor / device won’t be smart enough to know its owner or location), and then save it to Bigtable. Again, we’re getting a lot of power with very little code and effort.
But wait, there’s more! I also mentioned BigQuery. This is the service that allows us to run complex SQL queries across massive datasets (even datasets store in a non-SQL database). In other words, now that I’ve stored all this data, how do I get meaning from it? You could write a custom service, or just use BigQuery. It will let you perform queries and aggregations across terabytes of data in seconds.
So yes, for most of you, this could very well be the end of the architecture:
Seriously, that’s it. You could build any modern web service (social media, music streaming, email) using this architecture. It will scale infinitely and have a max response time for queries of around 3 - 4 seconds. You would only need to write the initial normalization and the required SQL queries for BigQuery. If you wanted even faster responses, you could target queries directly at Bigtable, but that requires a little more effort.
This is why storage is magic. Pick the right storage and you can literally build your entire web service in two weeks and never worry about scale, ops, or performance. Bigtable is my Patronus!!
Now we could stop here. Literally. We could make a meaningful and useful city platform with just this architecture. We’d be able to make meaningful reports and views on events happening throughout our city. However, we want more! Our goal is to make a smart city, one that automatically reacts to events.
Fourth step: temporal reasoning
Ok, this is where things start to get a little more complex. We have events—a lot of events. They are stored, timestamped and geolocated. We can query this data easily and efficiently. However, we want to make our system react.
This is where temporal reasoning comes in. The fundamental idea: we build business rules and insights based on the temporal relationship between grouped events. These collections of related events are commonly referred to as "intervals" or "frames." For example, if my interval is a lecture, the lecture itself can contain many smaller events:
We can also take the lecture in the context of an even larger interval, such as a work day:
And of course, these days can be held in the context of an even larger frame, like a work week.
Once we've built these frames (these collection of events), we can start asking meaningful questions. For example, "Has the average temperature been above the danger threshold for more than 5 minutes?", "Was maintenance scheduled before the spike in traffic?", "Did the vehicle depart before the parking time limit?"
For many of you, this process may sound familiar. This approach of applying business rules for streaming data has many similarities to a Complex Event Processing (CEP) service. In fact, a wonderful implementation of a CEP that uses temporal reasoning is the Drools Fusion module. Amazing stuff! Why not just use a CEP? Unfortunately, business rule management systems (BRMS) and CEPs haven't yet fully embraced the smaller, bite-size methodologies of microservices. Most of these system require a single monolithic instance that demands absolute data access. What we need is a distributed collection of rules that can be easily referenced and applied by a distributed set of autoscaling workers.
Fortunately, writing the rules and applying the logic is easy once you have the grouped events. Creating and extending these intervals is the tricky part. For our smart city platform, this means having modules that define specific types of intervals and then adds any and all related events.
For example, consider a parking module. This would take the readings from sensors that detect the arrival and departure of a vehicle and create a larger parking interval. An example of a parking interval might be:
We simply build a microservice that listens to ReadingsUpdated events and manages creation and extension of parking intervals. Then, we're free to make a service that reacts to the FrameUpdated events and runs temporal reasoning rules to see if a new command should be created. For example, "If there's no departure 60 minutes after arrival, broadcast an IssueTicket command."
Of course, we may need to correlate events into an interval that are outside the scope of the initial sensor. In the parking example, we see "payment made." Payment is clearly not collected by the parking sensor. How do we manage this? By creating links between the interval and all known associated entities. Then, whenever a new event enters our system, we can add it to all related intervals (if the associated producer or its assigned groups are related). This sounds complex, but it's actually rather easy to maintain a complex set of linkages in Bigtable. Google does on a significant scale (like the entire internet). Of course, this would be a lot simpler if someone provided a serverless graph database!
So, without diving too much into the complexities, we have the final piece of our architecture. We collect events into common groups (intervals), maintain a list of links to related entities (for updates), and apply simple temporal reasoning (business rules) to drive system behavior. Again, this would be a nightmare without using an event-driven architecture built on serverless computing. In fact, once we get a serverless graph database and distributed BRMS, we've solved the internet (spoiler alert: we'll all change into data engineers, AI trainers and UI guys).
While everything sounds easy, there are a few details I may have glossed over. I hope you found some! You’re an engineer, of course you did. Sorry, but this part is a little technical. You can skip it if you like! Or, just email me your doubt and I’ll reply!
A quick example is how do I query a subset of devices. We have everything stored in Bigtable, but how do I look at only one group? For example, what if I only wanted to look at devices or events downtown?
This is where grouping comes in. It’s actually really easy with Bigtable. Since Bigtable is NoSQL, it means that we can have sparse columns. In other words, we can have a family called "groups" and any custom set of column qualifiers in this family per row. In other words, we let an event belong to any number of groups. We look up the current groups when the command is received for the device and add the appropriate columns. This will hopefully make more sense when we go deeper in part three.
Another area worth a passing mention is extension and testing. Why is serverless and event-driven architecture so easy to test and extend? The ease of testing comes from the microservices. Each component does one thing and does it well. It accepts either a command or an event and produces a simple output. For example, each of our event Pub/Subs has a Cloud Function that simply takes the events and stores them in Google Cloud Storage for archival purposes. This function is only 20 lines of code (mostly boilerplate) and has absolutely no impact on the performance of other parts of the system. Why no impact? It's serverless, meaning that it autoscales only for its needs. Also, thanks to the Pub/Sub queues, our microservice is taking a non-destructive replication of input (i.e., each microservice is getting a copy of the message without putting a burden on any other part of our architecture).
This zero impact is also why extension of our architecture is easy. If we want to build an entirely new subsystem, we simply branch off one of the Pub/Subs. This means a developer can rebuild the entire system if they want with zero impact and zero downtime for the existing system. I've done this [transitioned an entire architecture from Datastore to Bigtable1], and it's liberating. Finally, we can rebuild and refactor our services without having to toss out the core of our architecture—the events. In fact, since the heart of our system is events published through serverless queues, we can branch our system just like many developers branch their code in modern version control systems (i.e, Git). We simply create new ways to react to commands and events. This is perfect for introducing new team members. These noobs [technical term for a new starter] can branch off a Pub/Sub and deploy code to the production environment on their first day with zero risk of disrupting the existing system. That's powerful stuff! No-fear coding? Some dreams do come true.
BUT—and this is a big one (fortunately, I like big buts)—what about integration testing? Building and testing microservices is easy. Hosting them on serverless is easy. But how do we monitor them and, more importantly, how do we perform integration testing on this chain of independent functions? Fortunately, that's what Part Three is for. We'll cover this all in great detail there.
Conclusion
In this post, we went deep into how we can make an event-driven architecture work on serveless through the context of a smart city platform. Phew. That was a lot. Hope it all made sense (if not, drop me an email or leave a comment). In summary, modern serverless cloud services allow us to easily build powerful systems. By leveraging autoscaling storage, compute and queuing services, we can make a system that outpaces any demand and provides world-class performance. Furthermore, these systems (if designed correctly) can be easy to create, maintain and extend. Once you go serverless, you'll never go back! Why? Because it's just fun!
In the next part, we'll go even deeper and look at the code required to make all this stuff work.
1 A little more context on the refactor, for those who care to know. Google Datastore is a brillant and extremely cost-efficient database. It is noSQL like Bigtable but offers traditional-style indexing. For most teams, Datastore will be a magic bullet (solving all your scalability and throughput needs). It’s also ridiculously easy to use. However, as your data sets (especially for streaming) start to grow, you’ll find that the raw power of Bigtable cannot be denied. In fact, Datastore is built on Bigtable. Still, for most of us, Datastore will be everything we could want in a database (fast, easy and cheap, with infanite2 scaling).
2 Did I put a footnote in a footnote? Yes. Does that make it a toenote? Definitely. Is ‘infanite’ a word? Sort of. In·fa·nite (adjective) - practically infinite. Google’s serverless offerings are infanite, meaning that you’ll never hit the limit until your service goes galactic.
We can then scan and select rows by using these keys. Sounds simple enough. However, we can make these rowkeys be compound indexes. That means that we carry multiple pieces of information within a single rowkey. For example, what if we had three kinds of users: admin, customer and employee. We can put this information in the rowkey. For example:
(Note: We're using # to delineate parts of our rowkey, but you can use any special character you want.)
Now we can query for user type easily. In other words, I can easily fetch all "admin" user rows by doing a prefix search (i.e., find all rows that start with "admin#"). We can get really fancy with our rowkeys too. For example, we can store user messages using something like:
However, we cannot search for the latest 10 messages by Brian using rowkeys. Also, there's no easy way to get a series of related messages in order. Maybe I need a unique conversation ID that I put at the start of each rowkey? Maybe.
Determining the right rowkeys is paramount to using Bigtable effectively. However, you'll need to watch out for hotspotting (a topic not covered in this blog post). Also, any Bigtablians out there will be very upset with me because my examples don't show column families. Yeah, Bigtable must have the best holidays, because everything is about families.
So, we can efficiently search our rows using rowkeys, but this may seem every limited. Who could design a single rowkey that covers every possible query? You can't. This is where the second major concept comes in: data mitosis.
What is data mitosis? It's replication of data into multiple tables that are optimized for specific queries. What? I'm replicating data just to overcome indexing limits? This is madness. NO! THIS. IS. SERVERLESS!
While it might sound insane, storage is cheap. In fact, storage is so cheap, we'd be naive to not abuse it. This means that we shouldn't be afraid to store our data as many times as we want to simply improve overall access. Bigtable works efficiently with billions of rows. So go ahead and have billions of rows. Don't worry about capacity or maintaining a monsterous data cluster, Google does that for you. This is the power of serverless. I can do things that weren't possible before. I can take a single record and store it ten (or even a hundred) times just to make data sets optimized for specific usages (i.e., for specific queries).
To be honest, this is the true power of serverless. To quote myself, storage is magic!
So, if I needed to access all messages in my system for analytics, why not make another view of the same data:
Of course, data mitosis means you have insanely fast access to data but it isn't without cost. You need to be careful in how you update data. Imagine the bookkeeping nightmare of trying to manage synchronized updates across dozens of data replicants. In most cases, the solution is never updating rows (only adding them). This is why event-driven architectures are ideal for Bigtable. That said, no database is perfect for all problems. That's why it's great that I can have SQL, noSQL, and HBASE databases all running for minimal costs (with no maintenance) using serverless! Why use only one database? Use them all!
Exporting to BigQuery
In the previous section we learned about the modern data storage model: store everything in the right database and store it multiple times. It sounds wonderful, but how do we run queries that transcend this eclectic set of data sources? The answer is. . . we cheat. BigQuery is cheating. I cannot think of any other way of describing the service. It's simply unfair. You know that room in your house (or maybe in your garage)? That place where you store EVERYTHING—all the stuff you never touch but don't want to toss out? Imagine if you had a service that could search through all the crap and instantly find what you're looking for. Wouldn't that be nice? That's BigQuery. If it existed IRL . . . it would save marriages. It's that good.
By using BigQuery, we can scale our searches across massive data sets and get results in seconds. Seriously. All we need to do is make our data accessible. Fortunately, BigQuery already has a bunch of onramps available (including pulling your existing data from Google Cloud Storage, CSVs, JSON, or Bigtable), but let's assume we need something custom. How do you do this? By streaming the data directly into BigQuery! Again, we're going to replicate our data into another place just for convenience. I would've never considered this until serverless made it cheap and easy.
In our architecture, this is almost too easy. We simply add a Cloud Function that listens to all our events and streams them into BigQuery. Just subscribe to the Pub/Sub topics and push. It’s so simple. Here's the code:
That's it! Did you think it was going to be a lot of code? These are Cloud Functions. They should be under 100 lines of code. In fact, they should be under 40. With a bit of boilerplate, we can make this one line of code:
exports.exportStuffToBigQuery = function exportStuffToBigQuery( event, callback ) {
myFunction.run(event, callback, (( data ) => { myFunction.sendOutput(eventData) });
};
Ok, but what is the boilerplate code? More on that in the next section. This section is short, as it should be. Honestly, getting data into BigQuery is easy. Google has provided a lot of input hooks and keeps adding more. Once you have the data in there (regardless of size), you can just run the standard SQL queries you all know and loathe love. Up, up, down, down, left, right, left, right, B, A!
Cloud Function boilerplate
Cloud Functions use Node? Cloud Functions are JavaScript? Gag! Yes, that was my initial reaction. Now (9 months later), I never want to write anything else in my career. Why? Because Cloud Functions are simple. They are tiny. You don't need a big robust programming language when all you're doing is one thing. In fact, this is a case where less is more. Keep it simple! If your Cloud Function is too complex, break it apart.
Of course, there are a sequence of steps that we do in every Cloud Function:
The only thing we should be writing is step 2 (and sometimes step 5). This is where boilerplate code comes in. I like my code like I like my wine: DRY! [DRY = Don't Repeat Yourself, btw].
So write the code to parse your triggers and send your outputs once. There are more steps! The actual sequence of steps for a Cloud Function is:
1) Filter unneeded events
2) Log start
3) Parse trigger
4) Do stuff
5) Send output(s)
6) Issue retry for timeout errors
7) Catch and log all fatal errors (no retry)
8) Issue callback
9) Do it all asynchronously
10) Test above code
11) Create environment resources (if needed)
12) Deploy code
Ugh. So our simple Cloud Functions just became a giant list of steps. It sounds painful, but it can all be overcome with some boilerplate code and an understanding of how Cloud Functions work at a larger level.
How do we do this? By adding a common configuration for each Cloud Function that can be used to drive testing, deployment and common behaviour. All our Cloud Functions start with a block like this:
It may seem basic, but this understanding of Cloud Functions allows us to create a harness that can perform all of the above steps. We can deploy a Cloud Function if we know its trigger and its type. Since everything is inside GCP, we can easily create resources if we know our output targets and their types. We can perform efficient logging and track data through our system by knowing the start and end point (triggers and targets) for each function. The filter allows us to limit which events arriving in a Pub/Sub topic are handled.
So, what's the takeaway for this section? Make sure you understand Cloud Functions fully. See them as tiny connectors between a given input and target output (preferably only one). Use this to make boilerplate code. Each Cloud Function should contain a configuration and only the lines of code that make it unique. It may seem like a lot of work, but making a generic methodology for handling Cloud Functions will liberate you and your code. You'll get addicted and find yourself sheepishly saying, "Yeah, I kinda like JavaScript, and you know . . .Node" (imagine that!)
Testing
We can't end this blog without a quick talk on testing. Now let me be completely honest. I HATED testing for most of my career. I'm flawless, so why would I want to write tests? I know, I know . . . even a diamond needs to be polished every once-in-awhile.
That said, now I love testing. Why? Because testing Cloud Functions is super easy. Seriously. Just use Ava and Sinon and "BAM". . . sorted. It really couldn't be simpler. In fact, I wouldn't mind writing another series of posts on just testing Cloud Functions (a blog on testing, who'd read that?).
Of course, you don't need to follow my example. Those amazing engineers at Google already have examples for almost every possible subsystem. Just take a look at their Node examples on GitHub for Cloud Functions: https://github.com/GoogleCloudPlatform/nodejs-docs-samples/tree/master/functions [hint: look in the test folders].
For many of you, this will be very familiar. What might be new is integration testing across microservices. Again, this could be an entire series of articles, but I can provide a few quick tips here.
First, use Google's emulators. They have them for just about everything (Pub/Sub, Datastore, Bigtable, Cloud Functions). Getting them set up is easy. Getting them to all work together isn't super simple, but not too hard. Again, we can leverage our Cloud Function configuration (seen in the previous section), to help drive emulation.
Second, use monitoring to help design integration testing. What is good monitoring if not a constant integration test? Think about how you would monitor your distributed microservices and how you'd look at various data points to look for slowness or errors. For example, I'd probably like to monitor the average time it takes for a single input to propagate across my architecture and send alerts if we slip beyond standard deviation. How do I do this? By having a common ID carried from the start to end of a process.
Take our architecture as an example. Everything is a chain of commands and events. Something like this:
If we have a single ID that flows through this chain, it'll be easy for us to monitor (and perform integration testing). This is why it's great to have a common parent for both commands and events. This is typically referred to as a "fact." So everything in our system is a "fact." The JSON might look something like this:
As we move through our chain of commands and events, we change the fact type and subtype, but never the ID. This means that we can log and monitor the flow of each of our initial inputs as it migrates through the system.
Of course, as with all things monitoring (and integration testing), life isn't so simple. This stuff is hard. You simply cannot perfect your monitoring or integration testing. If you did, you would've solved the Halting Problem. Seriously, if I could give any one piece of advice to aspiring computer scientists, it would be to fully understand the Halting Problem and the CAP theorem.
Pitfalls of serverless
Serverless has no flaws! You think I'm joking. I'm not. The pitfalls in serverless have nothing to do with the services themselves. They all have to do with you. Yep. True serverless systems are extremely powerful and cost-efficient. The only problem: developers have a tendency to use these services wrong. They don't take the time to truly understand the design and rationale of the underlying technologies. Google uses these exact same services to run the world's largest and most performant web applications. Yet, I hear a lot of developers complaining that these services are just "too slow" or "missing a lot."
Frankly, you're wrong. Serverless is not generic. It isn't compute instances that let you you install whatever you want. That’s not what we want! Serverless is a compute service that does a specific task. Now, those tasks may seem very generic (like databases or functions), but they're not. Each offering has a specific compute model in mind. Understanding that model is key to getting the maximum value.
So what is the biggest pitfall? Abuse. Serverless lets you do a LOT for very cheap. That means the mistakes are going to come from your design and implementation. With serverless, you have to really embrace the design process (more than ever). Boiling your problem down to its most fundamental elements will let you build a system that doesn't need to be replaced every three to four years. To get where we needed to be, my team rebuilt the entire kernel of our service three times in one year. This may seem like madness and it was. We were our own worst enemy. We took old (and outdated) notions of software and web services and kept baking it into the new world. We didn't believe in serverless. We didn't embrace data mitosis. We resisted streaming. We didn't put data first. All mistakes. All because we didn’t fully understanding the intent of our tools.
Now, we have an amazing platform (with a code base reduced by 80%) that will last for a very long time. It'w optimized for monitoring and analytics, but we didn't even need to try. By embracing data and design, we got so much for free. It's actually really easy, if you get out of your own way.
Conclusion
As development teams beginning to transition to a world of IoT and serverless, they will encounter an unique set of challenges. The goal of this series was to provide an overview of recommended techniques and technologies used by one team to ship a IoT/serverless product. A quick summary of each part is as follows:
Part 1 - Getting the most out of serverless computing requires a cutting-edge approach to software design. With the ability to rapidly prototype and release software, it’s important to form a flexible architecture that can expand at the speed of inspiration. Sound cheesy, but who doesn’t love cheese? Our team utilized domain-driven design (DDD) and event-driven architecture (EDA) to efficiently define a smart city platform. To implement this platform, we built microservices deployed on serverless compute services.
Biggest takeaway: serverless now makes event-driven architecture and microservices not only a reality, but almost a necessity. Viewing your system as a series of events will allow for resilient design and efficient expansion.
Part 2 - Implementation of an IoT architecture on serverless services is now easier than ever. On Google Cloud Platform (GCP), powerful serverless tools are available for:
IoT fleet management and security -> IoT Core
Data streaming and windowing -> Dataflow
High-throughput data storage -> Bigtable
Easy transaction data storage -> Datastore
Message distribution -> Pub/Sub
Custom compute logic -> Cloud Functions
On-demand, analytics of disparate data sources -> BigQuery
Combining these services allows any development team to produce a robust, cost-efficient and extremely performant product. Our team uses all of these and was able to adopt each new service within a single one-week sprint.
Biggest takeaway: DevOps is dead. Serverless systems (with proper non-destructive, deterministic data management and testing) means that we’re just developers again! No calls at 2am because some server got stuck? Sign me up for serverless!
Part 3 - To be truly serverless, a service must offer a limited set of computational actions. In other words, to be truly auto-scaling and self-maintaining, the service can’t do everything. Understanding the intent and design of the serverless services you use will greatly improve the quality of your code. Take the time to understand the use-cases designed for the service so that you extract the most. Using a serverless offering incorrectly can lead to greatly reduced performance.
For example, Pub/Sub is designed to guarantee rapid, at-least-once delivery. This means messages may arrive multiple times or out-of-order. That may sound scary, but it’s not. Pub/Sub is used by Google to manage distribution of data for their services across the globe. They make it work. So can you. Hint: consider deterministic code. Hint, hint: If order is essential at time of data inception, use windowing (see Dataflow).
Biggest takeaway: Don’t try to use a hammer to clean your windows. Research serverless services and pick the ones that suit your problem best. In other words, not all serverless offerings are created equal. They may offer the same essential API, but the implementation and goals can be vastly different.
Finally, before we part, let me say, “Thank you.” Thanks for following through all my ramblings to reach this point. There's a lot of information, and I hope that it gives you a preview of the journey that lies ahead. We're entering a new era of web development. It's a landscape full of treasure, opportunity, dungeons and dragons. Serverless computing lets us discard the burden of DevOps and return to the adventure of pure coding. Remember when all you did was code (not maintenance calls at 2am)? It's time to get back there. I haven't felt this happy in my coding life in a long time. I want to share that feeling with all of you!
Please, send feedback, requests, and dogs (although, I already have 7). Software development is a never-ending story. Each chapter depends on the last. Serverless is just one more step on our shared quest for holodecks. Yeah, once we have holodecks, this party is over! But until then, code as one.
We had an excellent experience partnering and working with Google to complete the PCI DSS certification on our platform-as-a-service (PaaS) that hosts customized OroCommerce sites for Oro customers. We're proud to partner with Google Cloud to offer our customers a secure environment.
Building PCI DSS compliant infrastructure
At its core, building a PCI DSS compliant infrastructure requires:
The platform used to build your service must be PCI DSS compliant. This is a direct compliance requirement.
Your platform must provide all the tools and methods used to build secure networks.
Google helped with both of these. The first point was easy, since all GCP services are PCI DSS compliant. In addition, Google provided us with a Shared Responsibility document that lists all PCI DSS requirements. This document explains the details of how Google achieves compliance and what Google customers need to do above and beyond that to support a compliant environment. This document not only has legal value but if used as a checklist, it can be a useful tool when going for PCI DSS certification.
For example, Google supports PCI DSS requirement #9, which mandates the physical security of a hosting environment including the need for guards, hard disk drive shredders, surveillance, etc. Hearing that Google takes the responsibility to protect both hardware and data from physical theft or damage was very reassuring. We rely on GCP tools to protect against inappropriate access and ensure day-to-day information security.
Another key requirement of a secure network architecture (and PCI DSS) is to hide all internal nodes from external access, control all incoming and outgoing traffic, and use network segregation for different application tiers. OroCommerce fulfills these requirements by using Google’s Virtual Private Cloud, firewall rules, advanced load balancers and Cloud Identity and Access Management (IAM) for authentication control. Google Site Reliability Engineers (SRE) have secure connections into production nodes inside the isolated production network using Google’s 2-step authentication mechanisms.
Also, we found that we can safely use Google-provided Compute Engine images based on up-to-date and secure Linux distributions. This frees the sysadmin from hardening of the OS, so they can pay more attention to vulnerability management and other important tasks.
While the importance of a secure infrastructure, access control, and network configuration is well-known, it’s also important to build and maintain a reliable logging and monitoring system. The PCI DSS standard puts an emphasis on audit trails and logs. To be compliant, you must closely monitor environments for suspicious activity and collect all needed data for a predetermined length of time to investigate any incidents. We found the combination of Stackdriver Monitoring and Logging, plus big data services such as BigQuery, helped us meet our monitoring, storing and log analysis needs. With Stackdriver, we monitor our production systems and detect anomalies in a thorough and timely manner, spending less time on configuration and support. We use BigQuery to analyze our logs so engineers can easily figure out what happened during a particular period of time.
Back in 2017 when we started to work on getting PCI DSS compliance for OroCommerce, we expected to spend a huge amount of time and resources on this process. But as we moved forward, we figured out how much GCP helped us to meet our goal. Having achieved PCI DSS compliance, it’s clear that choosing GCP for our infrastructure was the right decision.