Cloud Native Apps

 View Only

Securely Moving Data form external AWS account into Customer Environments

By Shrey Khokhawat posted 25 days ago

  

Introduction/Context 

Establishing an efficient data transfer mechanism between two distinct enterprises having data on AWS is a complex process. Data transfer encompasses substantial volumes of historical and ongoing incremental data. The primary focus remains on guaranteeing that the ingested data adheres to the security and privacy regulations of the participating organizations, with no modifications made to its content. Other areas of concern are cost and performance.

While there are a lot of native AWS solution which can be leveraged, there are reasons due to which it might not be possible to use many such out of the box solutions. For instance:

  1. Whitelisted services permitted by the organizations
  2. Cross-Account Permissions
  3. Data Security Measures
  4. Compliance with Data Ownership and Regulations
  5. Network Security standards and policies
  6. Data Integrity Checks
  7. Monitoring and Logging Protocols
  8. Cost effectiveness

This article explains an architectural pattern and techniques to transfer data between AWS accounts of different organizations. 

Existing AWS native services for data transfer

These AWS data transfer services provide organizations with a range of options for securely and efficiently transferring data between AWS accounts owned by different organizations, addressing various use cases, requirements, and constraints. Depending on the specific needs and constraints of the organizations involved, one or more of these services may be suitable for facilitating cross-organizational data transfers on AWS.

Existing Solution / Services

Description

Problems associated

AWS DataSync

AWS DataSync is a managed data transfer service that facilitates moving large amounts of data between AWS services, on-premises storage, and Amazon S3. It automates data transfer tasks, handles network optimizations, and ensures data integrity during transfer.

DataSync relies on network connectivity between the DataSync agent deployed in the source environment and the AWS services in both the source and destination AWS accounts. So organizations need to ensure that there is adequate network connectivity and firewall rules allow communication between the environments.

Additionally, it may have limitations with cross-account transfers and may require additional setup for inter-organization transfers.

AWS transfer family

AWS Transfer Family offers fully managed file transfer services that allow organizations to securely transfer files over Secure File Transfer Protocol (SFTP), FTPS (FTP over SSL), and other protocols directly into and out of Amazon S3 or Amazon EFS. 

Securing network communications with AWS Transfer Family services is vital for data protection. Use secure protocols (e.g., SFTP, FTPS) and implement network security measures. Note potential costs and limited support for advanced protocols. Many organizations restrict FTP read/write due to security concerns.

Amazon S3 Cross-Region Replication

Amazon S3 Cross-Region Replication (CRR) enables automatic and asynchronous data replication across different AWS regions. Organizations configure replication policies to copy objects from a source bucket in one AWS account to a destination bucket in another. 

Replicating data between AWS accounts owned by different organizations carries risks such as data loss, corruption, or inconsistencies. Consequently, organizations may restrict cross-account replication due to concerns about data governance, security, ownership, risk management, and legal compliance.

Additionally, there are configuration complexity and potential issues with maintaining data consistency and compliance across replicated buckets.

AWS Direct Connect

AWS Direct Connect establishes a dedicated network connection between an organization's data center or colocation facility and AWS. It provides a private and secure connection, bypassing the public internet for data transfer.

It can also be used to establish private connectivity between VPCs in different AWS accounts owned by different organizations. 

Requires upfront investment in networking infrastructure. May have limited availability in certain regions or require additional setup for cross-account transfers.

AWS Snowball

AWS Snowball is a physical data transport solution that allows organizations to transfer large amounts of data offline to and from the AWS cloud. It provides rugged storage devices that are shipped to the customer's location for data transfer.

It is particularly well-suited for transferring large volumes of data, such as backups, archives, media files, scientific data, or machine-generated data, to and from AWS, especially in situations where transferring data over the internet is impractical or cost-prohibitive.

Limited scalability and longer transfer times compared to online transfer methods. Requires manual handling and logistics for device shipping.

It is suitable for onetime historical load but for incremental workloads, organizations still need to rely on other solutions. Another downside of using AWS Snowball is that it brings along unwanted, restricted and as-is data lying at source.

Run parallel uploads using the AWS CLI

Running parallel uploads using the AWS CLI allows for faster data transfer by concurrently uploading multiple files to Amazon S3.

May require scripting or automation for large-scale uploads. Limited visibility and control compared to managed services. 

Robust error handling mechanisms and proactive monitoring are crucial for managing errors, failures, and interruptions, ensuring data integrity throughout the transfer process. 

Optimizing performance and managing costs during parallel uploads involves careful configuration and monitoring, including experimentation with parameters like concurrency and buffer size while considering factors such as data compression to minimize expenses.

Use an AWS SDK

Using an AWS SDK enables custom application development for data transfer between AWS accounts, offering flexibility and control over transfer operations.

Requires development effort for custom integration. May lack features compared to managed services.

Developing a custom application for data transfer between AWS accounts owned by different organizations using the AWS SDK poses several challenges. Security risks inherent in custom application development, including vulnerabilities in code and potential exposure of sensitive data, necessitate robust security measures such as encryption and access controls to ensure data integrity and confidentiality. Additionally, managing errors, retries, and failures gracefully is essential for maintaining the reliability of data transfer operations. 

Use Amazon S3 batch operations

Using Amazon S3 Batch Operations for data transfer between two AWS accounts owned by different organizations can provide a streamlined approach. With S3 Batch Operations, organizations can efficiently execute large-scale data transfer tasks, such as updating metadata or copying objects from a source bucket in one AWS account to a destination bucket in another. .

Limited to specific S3 management tasks and may not cover all data transfer scenarios. Security considerations involve managing access permissions and ensuring data integrity during batch operations.

Tracking the progress of batch operations and monitoring data transfer activities is critical. Implementing robust monitoring and auditing mechanisms helps detect and address any issues or discrepancies that may arise during the transfer process.

Use S3DistCp with Amazon EMR

S3DistCp, a distributed data transfer tool tailored for Amazon EMR, provides a robust solution for transferring data between AWS accounts. It efficiently copies large datasets across Amazon S3 buckets, harnessing EMR's processing capabilities to parallelize tasks, optimize network bandwidth, and handle significant data volumes effectively.

Requires setup and configuration of EMR clusters. May involve additional costs for EMR usage. Security considerations include managing access to EMR clusters.

Third-party Solutions

Various third-party tools and services are available for data transfer between AWS accounts, offering features such as enhanced security, optimization, and management capabilities.

May involve additional costs for licensing or subscription fees. Compatibility and support may vary across different solutions.

 

Concerns, complexities and challenges

In the following table, various concerns related to data transfer processes are outlined, along with the stakeholders raising the concerns and proposed approaches to address them. These concerns range from data security and privacy to performance and throughput considerations. Each concern is accompanied by recommendations put forward by relevant teams or individuals to mitigate associated risks and ensure smooth data transfer operations.

Concerns/Complexities/Challenges

Stakeholders

Typical Approach

Data Security & privacy

  • Data from external sources might bring in malware or malicious code that can infect client systems. This can even enable hackers to gain access and control.

CSO team

Before ingesting data into the client environment, it's essential to scan it thoroughly for security purposes. This scanning process should be conducted by the consumption team to enhance confidence in the data's integrity. While AWS offers various services like IAM, GuardDuty, Inspector, Macie, and Web Application Firewall to address security concerns effectively, AWS Glue and Lambda can also be utilized in their absence. These services can be employed to check for specific characters or text within the data, ensuring an additional layer of security.

Data Consistency and Integrity
  • Ingested data will be used for decisioning systems, incomplete data can cause inconsistencies. Gaps in data can lead to bias in statistical analysis. 
  • Any data brought in must meet all data quality dimensions. 
  • Need to consider high data volume on historical data, incremental process. Selective copy for BAU transfers and historical load.

CDO, data regulator and requestor 

Any data which is bought into the target environment should be put through completeness and accuracy checks. Data uniqueness needs to be ensured.

Implement validation checks and error handling mechanisms to ensure data integrity throughout the transfer process.

Compliance with regulatory requirements

CSO, legal teams

Verify that data transfer activities comply with relevant regulations, industry standards, and contractual agreements.

Performance and Throughput
  • Achieving optimal performance and throughput during data transfer is important to meet underlying SLA’s and OLA’s. 
  • Latency, bandwidth, and service limitations on both sides can impact transfer speeds and efficiency.

Application team

  • Parallelize Data Transfer and ensure scalability
  • Monitor and Optimize

Monitoring and Troubleshooting
  • Implementing comprehensive monitoring, logging and alerting mechanisms at each stage to help diagnose and troubleshoot transfer-related issues effectively.
  • Unavailability of Email notification from SNS in client environment

Application team

  • Establish query-based monitoring to track the transfer status.
  • Troubleshooting can be done through CloudWatch logs. 

Restrictions in target environment
  • Unavailability of AWS vanilla services like Amazon S3 batch Operations specific to such data transfers. Client uses its own whitelisting mechanism to make services available for general use. 
  • Restrictions on VPC usage with-in client environment.
  • Restriction on using roles interchangeably within the client environment.

CTO, Application development team

  • Custom solutioning through available whitelisted services. Exploring possibility of fanning out data copy. Identifying non functional requirements and work closely with CTO teams for enabling functionalities between available services required for implementation. 

 

Custom Data ingestion pattern using basic AWS services

While above two sections focus on out of the box AWS solutions, individual constraints and concerns, this section describes customized data ingestion pattern as a viable alternative. Following is the sequence of high level activities to be followed, 

  1. Any external data should be sensitized first before it gets copied to client environment. Therefore, any external data should first land in an isolated staging area where it is checked for security and compliance.
  2. Mandatory security checks include but not limited to character scanning and antivirus checks. 
  3. Any data pushed to Primary VPC should have passed all mandatory security checks else it needs to be quarantined. Green data needs to be further checked for completeness and accuracy and if those checks fail, it should be reported back to the source. Bad data should be moved to failed buckets. 
  4. State information across the pattern is maintained using manifest files. External tables are created on top of these manifest files and refreshed periodically to enable logging and monitoring. This state information is also used to restart application from any point of failure. 

Following picture depicts this approach, 

Solution Overview

A solution can be build with above approach using basic AWS services like Lambda and Glue, orchestrated via Step function. Below steps can be followed for the same,

  • Establishing Source VPC / Non Routable Ingress VPC: This involves setting up a Virtual Private Cloud (VPC) in AWS to create a secure network environment. Relevant AWS services include Amazon Virtual Private Cloud (Amazon VPC).
  • Conducting Comprehensive Scans: Utilize AWS services like AWS Lambda for executing scanning functions, and Amazon Simple Storage Service (Amazon S3) for storing and accessing the data to be scanned.
  • Enforcing Strict Access Controls: Use AWS Identity and Access Management (IAM) to define and enforce access policies, ensuring only authorized entities can access the data.
  • Streamlining Data Management with Dedicated Buckets: Set up dedicated Amazon S3 buckets within the VPC to organize and manage data efficiently.
  • Enhancing Security Measures with Regular Expressions: Implement regular expression matching for data scanning using AWS Lambda functions or AWS Glue for data preprocessing and analysis.
  • Utilizing Lambda Functions for Task Automation: AWS Lambda functions are used for automating tasks such as data copying, querying, and access control enforcement.
  • Keeping Stakeholders Informed with Notifications: Utilize Amazon Simple Notification Service (Amazon SNS) for sending notifications to stakeholders about incoming data partitions and associated actions.
  • Optimizing Data Processing Efficiency with Step Functions: Implement AWS Step Functions to orchestrate data processing tasks such as data copying and quality control checks, ensuring efficient and reliable execution.
  • Maintaining Transparent Data Tracking with Manifest Files: Use Amazon S3 to store manifest files at partition and file levels, providing visibility into data copy status and facilitating transparent data tracking.
  • Implementing Fail-Safe Mechanisms: Implement error handling and retry mechanisms using AWS Step Functions or AWS Lambda to mitigate the risk of system failures.
  • Strengthening Data Integrity Verification: Use AWS services like Amazon S3 and AWS Glue for data validation, and AWS Key Management Service (KMS) for encryption to ensure data integrity during transfer and storage.
  • Ensuring Data Accuracy and Completeness through Reconciliation Checks: Conduct reconciliation checks using AWS Glue or AWS Step Functions, and maintain data manifest records in Amazon S3 to track data accuracy and completeness.

Addressing Non functional requirements

To execute data transfer between AWS accounts belonging to different companies or organizations effectively below NFR's are critical. Below tables show how NFR's are managed using AWS services

 Key Consideration  Solution                 
 Relevant AWS Services
Importance

Access Control, Data Ownership and Responsibility         

 Assigning IAM roles to EC2 instances. Documenting   data ownership in a data sharing agreement                 
 IAM, AWS KMS                                                Crucial for data confidentiality and preventing unauthorized access. Vital for defining accountability and ensuring smooth transfer processes  
Secure connectivity between organizations
 Configuring VPC peering between accounts             Amazon VPC, AWS Direct Connect                             Essential for establishing reliable communication channels          
Cost Optimization                      Setting up AWS Budgets to monitor data transfer costs  AWS Cost Explorer, AWS Budgets                             Essential for managing expenses and optimizing resource utilization  
Monitoring and Logging (Observability)             
 Enabling AWS CloudTrail to log API calls              AWS CloudTrail, Amazon CloudWatch                          Crucial for ensuring data integrity, auditing, and troubleshooting    
Data Compliance and Governance     
 Implementing data retention policies compliant with regulations  AWS Artifact, AWS Config                                  Critical for adhering to regulations and maintaining data integrity 
Scalability, and Performance Optimization   Load testing of data transfer processes              AWS Lambda and Step Function for orchestration             
Ensures scalability, and efficiency of data transfer process 
Data Lifecycle Management              Implementing lifecycle policies to transition data to cheaper storage classes   Amazon S3 Lifecycle Policies, AWS Glue
 Ensures efficient data storage, retrieval, and compliance with policies 
Backup and Disaster Recovery           Configuring cross-region replication for critical data   AWS Backup, Amazon S3 Cross-Region Replication within target organization    
 Essential for ensuring data availability and resilience against failures 
Data Encryption and Security           Encrypting data at rest using AWS KMS-managed keys   AWS Key Management Service (KMS), Amazon S3 Server-Side Encryption   Critical for protecting sensitive data from unauthorized access 

Availability

Ensure fail-over by setting up multi-AZ deployment

 Amazon RDS Multi-AZ, Amazon Route 53 Health Checks        Essential for maintaining data integrity and minimizing downtime     

Recovery

Maintain State information

Manifest stored in S3 buckets 
Essential for starting from point of failure in case on SRE event

Salient features

  • This pattern does not rely on specialized AWS services for data transfer  (which might not be whitelisted in the organization). It solely relies on Lego blocks of AWS.
  • Compliance is the top most priority for this pattern.
  • Optimum performance due to parallelization.
  • High Availability
  • Cost efficient
  • Fine grained Observability and traceability. 
  • Allows chaining of data filters and compliance checks mechanisms. For e.g.; special characters, antivirus scans etc but it also gets checked for its completeness and accuracy. This pattern can also be extended to include any additional data quality checks, chaining logic and functional requirement checks.

Conclusion

This article provides an alternative to existing specialized AWS data movement services that cannot be used because of organizational and technical constraints. This alternative is a low cost high performing solution capable of transferring data between AWS accounts securely  which are owned by different enterprises. The solution is configurable and customizable as per enterprises standards and policies. 

0 comments
23 views

Permalink