Member-only story
Understanding AWS Data Pipeline: Core Concepts and Functionality
AWS Data Pipeline is a web service designed for automating the process of moving and transforming data. Utilizing AWS Data Pipeline allows you to establish data-driven workflows, enabling tasks to rely on the successful completion of preceding ones. The service allows you to specify the parameters for your data transformations, with AWS Data Pipeline ensuring the enforcement of the established logic.
Data Pipeline Concepts
Pipeline Defination
A pipeline definition serves as the means by which you convey your business logic to AWS Data Pipeline. AWS Data Pipeline takes charge of identifying tasks, scheduling them, and assigning them to task runners. In the event of a task not achieving successful completion, AWS Data Pipeline initiates retries based on your specified instructions and, if needed, reassigns the task to another task runner. Should a task encounter repeated failures, you have the option to configure the pipeline to notify you.
To illustrate, within your pipeline definition, you may specify that log files generated by your application should be archived monthly throughout the year 2013 to an Amazon S3 bucket. AWS Data Pipeline would then generate 12 tasks, each responsible for transferring a month’s worth of data, regardless of the varying number of days in each month.
Pipeline Components
Components within a pipeline embody the business logic and are denoted by distinct sections in a pipeline definition. These components delineate the data sources, activities, schedule, and preconditions of the workflow. Properties can be inherited by components from their parent counterparts. Relationships between components are established through references, and collectively, pipeline components set the guidelines for data management.
Instances
When AWS Data Pipeline executes a pipeline, it assembles the pipeline components to generate a series of executable occurrences. Each occurrence encapsulates all the details necessary for executing a specific task. The comprehensive collection of occurrences constitutes the task list for the pipeline. AWS Data Pipeline allocates these occurrences to task runners for execution.
Attempts
For robust data management, AWS Data Pipeline engages in repeated attempts for any failed operation. This process…