Application of Multilayered Defense-in-Depth Approach to Security using Hitachi’s custom 4C Cloud Security Model.
One of the most frequently underestimated areas in the implementation of a cloud platform or applications is the emphasis on security. The purpose of this white paper is to bring more focus to the importance of cloud security and our approach as prescribed through Hitachi Vantara’s customized 4C Security Model.
Before we get into specifics of our own view of the 4Cs model, let’s look at what provides a good approach to architecting cloud security. No matter what level of security one administers in their ecosystem, it has to be understood that it will be breached. So, one way to implement security is to anticipate this and adopt a defensive approach to security, thereby designing it for potential breach and then protecting it.
Note this document does not cover aspects of compliance, the focus is only on security.
Defense-in-Depth Security Model
One of the industry standard approaches is the DiD or Defense-in-Depth security model, wherein security itself is layered. So that even if one of the layers were compromised, other layers would come to its protection, thereby reducing the blast radius of the attack.
Figure 1 below provides our visualization of Defense-in-Depth protection.
Defense-in-Depth is a conceptual model consisting of seven layers of security for maximum isolation and protection. Each layer focuses on specific responsibility and owns or shares responsibility with the cloud service provider (CSP).
Hitachi’s 4C Security Model
While the four C’s of cloud-native security model is not specific to Hitachi, our custom 4C security model is an implementation of the DiD model focused on physical aspects of security surrounding:
1. Cloud
2. Cluster
3. Container (Docker and any application container)
4. Code (including both code and data protection)
Follow the color key in Figure 2 below for an overlay of Hitachi’s 4C on the DiD model to visualize their interconnected responsibilities.
From here we can delve into deeper detail.
Cloud Level
We cannot have any discussion of cloud security without touching upon the enterprise landing zone (LZ). LZ is the foundational architecture for any cloud platform and focuses on provisioning, configuration, governance and management of cloud resources, relative to security, scale and cost.
Specifically focusing on Azure cloud LZ, the following are some of the attributes that will come into focus from a security standpoint on Azure.
In Azure, subscriptions are the containers for the security of its resources. All resources within a subscription can be scoped for security, cost and usage. As subscriptions don’t cost anything in Azure, best practice is for applications to leverage a multi-subscription strategy based on their needs. One of the recommendations is to realize subscription based on the type of tasks – ideally isolating the platform team’s tasks from the application team’s, and even the application team’s segregated by life cycle stages. If anything, at minimum you have a separate subscription between platform vs. application (production vs. non-production ) so that security, usage, cost and management can be isolated.
A reference subscription strategy is provided in the next section.
When multiple subscriptions are leveraged in an implementation, an Azure management group is one of the most effective ways to assign or manage roles and policies. Management Groups act as a placeholder for organizing subscriptions and the access policies within them.
A reference subscription strategy and management group is shown below where the platform subscriptions are segregated from the application subscriptions. The management group is a hierarchy and the “project 1” prod subscription resources are segregated from their non-prod equivalent.
While best practices for building an Azure LZ can be an article by itself, it is good to have a sandbox subscription in every implementation to foster learning.
- Hub & Spoke Network Isolation Model
It is a good security practice to enable isolation in the network infrastructure. One of the approaches is to isolate the hub from the spoke networks.
Hub is the edge of the network and should be protected with the necessary with authentication providers, web application firewall ingress rules for DDoS prevention and egress rules to prevent unnecessary traffic out of the cloud. Every spoke is the application workload which is further segmented based on the project or type of the workload – prod vs. non-prod into individual isolated networks and subnets.
- Identity & Access Management
Azure Active Directory is the default identity and access management service in Azure. Every customer is a tenant in the Azure Active Directory. Every user, group and application are set up within tenants and can be configured as standard Azure RBAC built-in roles, resource-specific roles, Azure service-specific role or a custom role which enables access to the Azure resources.
Users’ access to resources in Azure are enabled by exchanging identity tokens with AD upon successful login and with applications setup for trusting the identity provider token enables single sign-on (SSO) of the users with the apps. The token lifetime can be decided and forced for renewal based on the type of applications the business impacts. Access to the applications are the responsibility of the application themselves.
Authorization of the users to the apps are controlled by the application using RBAC roles.
Managed identities are one of the standard ways of using RBAC within Azure. While applications are configured in Azure with ClientID and client password, just like the UserId and password of users, but they will then have a problem maintaining this configuration in either a config file, or if managed in a key vault, will require additional ways to authenticate into the vault for access. Either way is not an elegant solution. Azure AD uses the managed identities associated to application and which can be set up for access in other resources.
There are two types of managed identities – system generated, which has a 1-to-1 correspondence to a resource or application, or user-generated, which can be shared with multiple resources. User-generated identities are used when there are multiple applications that require the same type of access to a resource, reducing the overhead of configuring unique identities for each.
- Governance, Auditing and Monitoring
While all of the RBAC sets up the access control rules, the governance enforces them. Authored as JSON policies, governance enforces the access controls through monitoring and audits at provisioning, usage and cost levels. Any exceptions to the policies will be prevented and escalated using alerts.
Figure 5 below illustrates how a reference Azure enterprise LZ is overlayed with DiD security.
Cluster Level
These are some of the best practices followed for securing the AKS Cluster.
All the operations in a K8s cluster are performed by the REST APIs provided by the API server, part of the K8s control plane. The AKs control plane’s notAPI server is created with a public IP and public FQDN which open up attack vectors requiring protection. While the LZ hub and spoke network isolation model will ensure some level of protection to the cloud, the following are the best practices to be followed to protect the cluster.
1) Isolate API Server – There are 2 options here:
Isolate the traffic to API server to known IP ranges which are specific to the worker node pool and any other known traffic sources. This is the recommended approach.
Isolate the API server to private network. In this scenario, both the control plane and data plane would end up in separate VNET, requiring a VNET peer or a private link underlay to enable communication. This one is a bit more complex.
2) Enable Kubernetes Security – Security in K8s includes AuthN/AuthZ of all the requests to the K8s API. There are two types of users, or subjects, that K8s security has to handle.
Users – human users like developers and admins who run Kubectl command from the K8s command line for different operations. Users are managed outside of K8s and they have cluster-level scope.
While there are multiple authentication plugins in K8s, best practice is to use an auth token provided by an Identity provider like AD upon successful login and use that for the identity and authorization with K8s using groups.
ServiceAccount are users that run on behalf of the applications running inside a pod requesting API server for service in the cluster. ServiceAccounts have namespace level scope where the pod is running.
They can be created manually by mounting a secret (JWT token) volume to the container where the application is running. Alternatively, they can be generated automatically by having K8s associate the pod to the default ServiceAccount at the namespace level.
Best practice is to have ServiceAccounts created manually as having them created automatically requires pods to have additional permissions to read the namespace metadata. Also from K8s version 1.24, automatic generation is not going to be supported. Both users and ServiceAccounts are associated with groups. Leverage groups for better management of RBAC.
3) Authorization Plugin – Authorization in Kubernetes is through RBAC. RBAC is offered in Kubernetes through Roles, ClusterRoles for Namespace and cluster levels respectively. The RoleBindings and ClusterRoleBindings objects are leveraged for binding the roles to users or groups or ServiceAccounts.
Best practice is to manage the RBAC policies in the identity manager like AAD and have these accessed by K8s.
4) Managing Network Traffic Into/Within Clusters – Every data plane has to be isolated by a virtual network for layered security and any ingress and egress traffic into and out of the cluster have to be properly protected by the necessary ingress and egress policies. This would include configuring edge router rules and K8s network policies. Also having the necessary subnets and network rules will provide additional protection. One of the key decisions to be made is on the container network interface plugin – these are devices that provide the IPAM management and the addressing schemes. Azure by default supports two different network interface plugins – Kubenet and Azure CNI. For any application that requires egress, network policies require the Azure CNI plugins. But Azure CNI uses a subnet IP resulting in a flat addressing scheme consuming a lot of IP address ranges.
While there isn’t a best practice and the design depends on the kind of problem being solved, for stringent network policies use an Azure CNI understanding that there will be 70% less IP addresses when compared to its Kubenet counterpart due to flat network addressing. If you want advanced network isolations like personal FW for apps you have to consider third-party plugins like Calico, Juniper or Weave CNI which are also supported by Azure.
5) Managing Service2Service Traffic – We recommend leveraging a service mesh resource for this purpose. AKS supports LinkerD.
A service mesh provides:
* An intelligent proxy for routing and retries in the event of failures and circuit breaking.
* Better visibility and tracing at the call level.
* Above all, better security enabling mTLS up to the container level in a pod.
We recommend leveraging a service mesh proxy in the solution as it alleviates lots of code for some of the capability above. Also, if there is a need for mTLS into the container, do not use cert code, instead leverage service mesh proxy.
6) Host-Level Security Enhancements Using MAC – Sometime applications may have hardened security requirements. Linux by default only provides DAC – discretionary access control and not mandatory access control (MAC). Leverage SELinux or AppArmor profiles at the host level for improving overall security.
Make sure indulgence of MAC is done only in cases of absolute necessity as there is significant overhead to effort and performance due to this.
7) Set Limits and Quotas at the Namespace Levels – It is a good practice to set the limits and quotas at the namespace level as it controls the blast radius in the event of any failures due to vulnerabilities. These are some additional best practices.
* ResourceQuota is a combined request and limits of CPU, memory, pods, services, PVC, replicas, etc. created on a namespace. Good practice is to have this created at least for CPU and memory.
* Limit ranges good practices are to define the min, default and max range of limits and requests so that you can enforce optimal microservices design.
Mature tech organizations control their microservices design through T-shirt sizing limits and quotas like min for small, default for medium and max for large microservices.
8) Isolate Production and Non-Production Cluster – An AKS cluster is created within a VNET and inside a Resource Group (RG) within Azure. Embrace the multi-subscription strategy as mentioned in the previous section and ensure that the clusters are created are part of different Azure subscription between prod and non-prod for maximum isolation.
Isolate sensitive workloads to separate nodepools within a cluster using taints and tolerations. Also isolate access using separate ingress rules.
9) Segregate Namespaces for Applications and Teams – Namespaces are the virtual partitions of K8s clusters and serve as a boundary for the applications or teams from a security and access management standpoint.
* Never launch applications within the default namespace.
* Namespace, ResoureQuota and limit ranges enable limiting the attack surface area and degree of attack, use them.
* Leverage PDB to ensure that critical workloads are not completely drained in the event of a node maintenance.
10) Forced Compliance Using OPA (Open Policy Agent) add-ons – While teams can be trained on the best practices and security processes can be reviewed for compliance, nothing can replace some level of forced compliance. This will ensure automated processes are set up to watch for non-compliance, preventing and flagging them off as alerts.
OPA Gatekeeper is an add-on for K8s admission controller which can enable this. By creating policies and associating them to admission controller, compliance enforcement can be done.
11) Increase Cluster Level Auditability
* Increase observability through effective logging.
* Logging includes MELT – monitor, events, logs and traces.
* Consider logs as event streams which are streamed from every node and managed in a centralized store.
Container Level
These are some of the best practices followed for container-level security.
1. Pod as a single responsibility – Pod lifecycle depends on the container lifecycle within them.
* Ideally keep one container per pod as there is only one reason for pod recycle.
* Having said that, if a group of containers were to have the same life cycle it is ok to keep them in the same pod so that they can be launched and deleted together.
* A good example is the sidecar container. Either way, keep only one image per container.
2. Prefer Smaller Container Images – Smaller size containers provide a lot of benefits as it is easy to pull from the Image repos and perform better. Following are some good practices for smaller Docker container images.
* Create multi-stage Docker builds whereby the build process is separated into a different stage from the packaging stage.
* Separate application class from dependencies to ensure leveraging Docker caching.
* Alternatively, use Docker daemonless container builds for Java using Jib which gets added as Maven plugin into POM file. Jib optimizes build caching by breaking different parts of the build into different layers, all without using Dockerfile.
3. Prevent Using Poisoned Images – Ensure all images are vulnerability scanned before use.
* Ensure usage of lightweight base OS images from well-known repositories and add only the necessary dependencies and tools like shell and interactive debugging.
* Use SAST tools like Twistlock, Nautilus, Sysdig, Snyk, etc. for vulnerability scanning of images. Do it as part of the CI/CD pipeline.
* Only deploy signed images from well-known distro sources. Use open source add-ons like IBM Portieris policies to the K8s admission controller for enforcing image security and trust policies before allowing it inside the cluster for deployment.
* Do not keep secrets in the images. With Azure, one of the better practices is to manage the Secrets – JWT tokens or certificates or keys in an Azure Key Vault and have this accessed by the AKS cluster. There are few nuances to this:
A. As such, an AKS cluster cannot access the Azure KV keys directly.
B. It would need an add-on like CSI (container storage interface) drivers for this.
C. Again, any secret inside the Key Vault requires setting up within Azure’s resources managed identity for its access.
D. In our case we could use the AKS’s VMScaleset’s managed identity which will host the podD where the application is going to run; configured for access to the secret resource. This would require a secret provider class where the location of the Key Vault and what type of resource are all configured.
E. Finally, the secret provider class is mapped into the pod in the AKS cluster by creating a volume mount of the secret provider class which will mount the secret into the POD.
4. Reduce Attack Surface – Embrace a least privilege access model by default. Ensure the pod Security context is set to the following.
* Ensure that pod spec of the workloads does not have hostPIDSet and hostIPC set to true. If set to true, this pod can use the process ID namespace and IPC namespace of the host making it amenable for attacks of the host process.
* Ensure that the pod Securitycontext spec of the pod is not runAsNonRoot : true. This setting ensures that the container in the pod are not allowed to run as root users.
* Container spec is configured for allowPrivilegeEscalation: false. This setting ensures any parent process cannot be escalated for its children to operate in privileged mode.
* Delete any privileged: true setting on the Securitycontext spec of the container to prevent running the pods in privileged mode.
* No capabilities attributes are set to SYS_ADMIN, NET_ADMIN potentially enabling malicious processes in containers to intrude into applications.
* Ensure the setting for readOnlyRootFilesystem: true of the container spec ensure any root file system mounted into the container is read.
5. Limit Blast Radius – Enforce requests and limits at the container level to secure the cluster from poisoned images:
* More the CPU and memory, the more the capability of the code to go rogue.
* Restrict the container resource to less than one CPU core or 1000mi to be able to schedule it in any node confidently.
* Good practice is to set the request and limits 25% over the CPU and memory utilized as computed during the capacity planning.
6. Configure Health Check Probes for Pods
* Setup readiness probe to validate if the pod is ready to take traffic upon startup.
* Setup liveness probe to check if the pod is still alive to continue to take traffic.
7. Choose the Appropriate K8s Controller for Workload Invocation
* For stateless pods, leverage the deployment kind which will give the PaaS kind of capabilities to the pods and enable autoscale.
* For a pod on every node within the cluster, use a DaemonSet kind.
* For Stateful pods like DB or message broker leverage the StatefulSet kind.
* Use Init container to prepare the right environment for a stateful or stateless pod.
8. Leverage Pod Autoscaler, PDB and QoS for Execution Guarantees – While it is a good practice to capacity plan any workflow, follow the below:
* Take advantage of the horizontal (HPA) and KEDA (K8s event-driven Autoscaler) to get the PaaS type of capabilities and rules-based scaling.
* Apply HPA for a group of nodes that have the same capacity.
* Apply pod disruption budget (PDB) for application and system pods to ensure availability during critical workloads.
9. Pod Execution QoS Guarantees in the Event of Resource Starvation
* Pods considered top-priority, make sure they are set up for CPU and memory limits to guarantee execution. By setting the limits 25% more than the original capacity these pods are guaranteed to work without disruption.
* Pods considered lower-priority, like API GW processes which are already spread across 100s of nodes, set this to no CPU and memory limits to create best case QoS. These will continue to take more resources and when starved will be evicted first.
* Generic workloads with some priorities, set them up for burstable or with either request or limits setup.
Default pod behavior: K8s pod do not get killed but just throttled if it exceeds CPU capacity of the node. However, if the memory were exceeded the available capability, it is killed.
10. Container Deployment Strategies
*While Rolling updates are the native and default approach to K8s, a better approach is to do a blue/green (B/G) deployment which also guarantees a zero-downtime upgrade with rollback capabilities. B/G deployment guarantees more pod availability than rolling updates.
* Canary deployment is an extension to the B/G deployment. Versions are released for a subset of users. Error rates and performance can be monitored before deploying completely.
* AKS supports both canary and B/G deployment. We recommend canary as it has the advantages of B/G, as well as providing small, staged releases.
Code Level
Code level includes both code and data protection.
1. Code Must be Enabled for AuthN and AuthZ Using SSO
We have discussed this level of application security in the cluster section under ServiceAccount and RBAC roles.
2. Data Protection at Rest
Azure storage accounts provides SSE (server-side encryption) tokens which can protect data at rest.
3. Data Protection in Motion
Use of client-Side encryption using certificates. We discussed this in length in the service mesh portion within the cluster level.
4. Instrumentation of the Code Following Open Telemetry Standard for Observability
We instrument MELT – metrics, events, logs and traces. There are several techniques for extracting these from the containers, StatsD, CollectD or cAdvisor agents from the clusters. Azure container insights is the platform that is responsible for this. A view of the same is provided below.
Conclusion
While comprehensive, this document is by no means the complete guide to cloud security. but it attempts to stay close to the truth and serve as a standard for security within the cloud and Kubernetes space. By taking a highly defensive multilayered approach to the established 4C security concept, we believe companies can be proactively prepared to prevent and, when necessary, minimize and mitigate, whatever kind of threat comes next.
This document will be updated as new vulnerabilities and fixes are identified and created. Please reach out to Chandra Ganapathy, Chandra.Ganapathy@Hitachivantara.com, with any questions or comments or if you would like to have your cloud security initiative validated.
Find Out More ->
Learn more about Hitachi Cloud Security Services, which are offered as part of Hitachi Application Reliability Centers (HARC) services.
#CloudSecurity