This blog was co-authored by DeNA Co., Ltd. and Amazon Web Services Japan.
DeNA Co., Ltd. (DeNA) engages in a variety of businesses, from games and live communities to sports & the community and healthcare & medical, under our mission to delight people beyond their wildest dreams. Among these, the healthcare & medical business handles particularly sensitive data. To comply with their data policies for sensitive data, this healthcare & medical business set the following requirements for their data processing:
- Process data in compliance with data policies – Mask or delete sensitive data as necessary to transform into anonymized data. Prevent the inclusion of invalid values in categorical data and process data without any data loss.
- Conduct data quality tests on anonymized data in compliance with data policies – Conduct data quality tests to quickly identify and address data quality issues, maintaining high-quality data at all times.
This post introduces a case study where DeNA combined Amazon Redshift Serverless and dbt (dbt Core) to accelerate data quality tests in their business.
The challenge
Data quality tests require performing 1,300 tests on 10 TB of data monthly. Previously, DeNA ran Python-based batch jobs on Amazon Elastic Compute Cloud (Amazon EC2) to perform these data quality tests. As business and data volume grew over time, DeNA started to face the following challenges:
- Performance – Data quality tests took days to weeks to complete because engineers hadn’t designed the batch jobs to handle big data.
- Cost – Costs increased due to the batch job design, particularly for large datasets. The implementation required loading data into memory for processing. When handling large table data, DeNA needed to use large memory-optimized EC2 instances.
- Maintainability – The batch job implementations varied significantly between engineers, leading to high maintenance overhead, because the required knowledge was siloed among individual engineers.
The switch to Redshift Serverless and dbt
To address these challenges, DeNA decided to adopt Redshift Serverless and dbt (an open source data transformation tool) for the following key reasons:
- Scalable and cost-effective processing with Redshift Serverless
- Standardized and maintainable data quality tests with dbt
This decision was made after careful comparison of alternative solutions. DeNA initially considered parallelizing the existing Python-based batch jobs but rejected this approach due to the high maintenance overhead and siloed knowledge associated with the batch jobs. Instead, DeNA decided to use dbt, which DeNA has been using in their healthcare & medical business, and connect it to an AWS service capable of large-scale distributed processing. dbt provides a SQL-first templating engine for repeatable and extensible data transformations, including a data tests feature, which allows verifying data models and tables against expected rules and conditions using SQL. By using dbt, DeNA could standardize the technical stack, implement data quality tests in maintainable SQL, and connect dbt to a managed service for scalable and cost-effective processing.
AWS offers several services that are compatible with dbt, including Amazon Redshift and AWS Glue. DeNA selected Redshift Serverless, primarily due to its serverless nature, optimal cost-performance, and the superior processing performance for structured data typical of a data warehouse service.
Solution overview
DeNA designed the following architecture using AWS serverless services.
The workflow consists of the following high-level steps and key design points:
- The source system stores the target data for the data quality tests in Amazon Simple Storage Service (Amazon S3). When new data files are added, Amazon EventBridge invokes an AWS Step Functions state machine (workflow). To make sure all files for target data are delivered, the source system stores a completion file in Amazon S3.
- dbt runs on Amazon Elastic Container Service (Amazon ECS) using AWS Fargate, an AWS serverless container service. DeNA selected Amazon ECS because it allows running dbt in a serverless, pay-per-use manner, and DeNA had prior experience developing and operating applications using Amazon ECS. To allow the containers to securely access Redshift Serverless, DeNA used the pass sensitive data to an ECS container feature to pass sensitive credentials that are stored in AWS Secrets Manager to the containers using an ECS task execution IAM role.
- DeNA segmented Redshift Serverless into separate workgroups for access control. Operation personnel may need to access the Redshift Serverless database using the Query Editor V2 to investigate issues with data quality tests, while maintaining strict access control. Redshift Serverless allows fine-grained access control to data by using database security features, similar to how the GRANT command is used in database products. However, in this workload, DeNA chose to use AWS Identity and Access Management (IAM) to control access to the workgroups at IAM level. This allowed DeNA to restrict access to specific Redshift Serverless workgroups based on users’ IAM roles, enabling unified management of authorization through IAM. Additionally, by separating the workgroups, DeNA could individually adjust Redshift Processing Units (RPUs) per workgroup, contributing to cost optimization.
- Amazon ECS sends execution logs of dbt running to Amazon CloudWatch Logs for observability. DeNA used metric filters to convert the logs into CloudWatch metrics, then created alarms based on these metrics. When triggered, these alarms invoke AWS Lambda functions using Amazon Simple Notification Service (Amazon SNS). The Lambda functions create result reports of dbt running and data quality tests and send them to an internal chat application. DeNA visualizes the results of data quality tests using the elementary CLI, a dbt-based data observability solution. This workflow enables even non-engineers to track data quality status effectively.
Outcomes
DeNA successfully addressed all the challenges they faced by designing the solution and migrating to a new platform:
- Performance – Improved performance up to 100 times faster by reducing processing time from days or weeks to 1–2 hours. A certain data quality test that previously took 877 minutes now completes in 1 minute, thanks to the large-scale distributed processing capabilities of Redshift Serverless.
- Cost – Reduced costs by 90% with AWS serverless services. Optimized expenses by incurring costs only for data quality tests.
- Maintainability – Standardized the technical stack with dbt, eliminating siloed knowledge from custom programs. dbt’s data tests feature simplified the implementation of data quality tests. The elementary CLI improved the observability of data quality tests for non-engineers. AWS serverless services virtually eliminated the operational overhead for managing the workload infrastructure.
Conclusion
This post demonstrated how DeNA was able to securely and efficiently accelerate their data quality tests by combining Redshift Serverless and dbt. This combination is not only effective for DeNA’s use case but also applicable to various business use cases across different industries.
For more information on the combination of Redshift Serverless and dbt, refer to the following resources:
About the Author
Momota Sasaki is an Engineering Manager at DeSC Healthcare, a subsidiary of DeNA. He joined DeNA in 2021 and was seconded to DeSC Healthcare. Since then, he has been consistently involved in the healthcare business, leading and promoting the development and operation of the data platform.
Kaito Tawara is a Data Engineer at DeSC Healthcare, a subsidiary of DeNA, focusing on improving healthcare data platforms. After gaining experience in backend development for web systems and data science, he transitioned to data engineering. He joined DeNA in 2023 and was seconded to DeSC Healthcare. Currently, he works remotely from Nagoya-city, contributing to the enhancement of healthcare data platforms.
Shota Sato is an Analytics Specialist Solution Architect at AWS Japan, focusing on data analytics solutions powered by AWS for digital native business customers.