Reusable Python Code Modules, Part 12 - Data Transformation and Formatting

Optimizing Data Flow: Building Reusable Data Transformation and Formatting Modules

May 11, 2023

Data transformation and formatting are critical processes in backend development, ensuring data consistency, compatibility, and usability across different systems. This guide covers how to structure reusable data transformation and formatting modules in Python using Flask, manage data efficiently, and integrate common libraries.

Common Libraries and Tools

1. Pandas

Pandas is a powerful data manipulation and analysis library for Python

Key Features

DataFrame: Provides a DataFrame object for efficient data manipulation
Flexible Data Manipulation: Supports a wide range of data manipulation operations
Integration with Other Libraries: Easily integrates with other data analysis libraries
Rich Functionality: Includes extensive functionality for data cleaning, transformation, and formatting

Python Serialization and Deserialization in Flask SQLAlchemy | by Srimathi radha krishnan | Medium

2. Marshmallow

Marshmallow is an ORM/ODM/framework-agnostic library for object serialization and deserialization

Key Features

Schema-Based Validation: Validates and transforms input data based on defined schemas
Serialization and Deserialization: Converts complex data types to and from native Python data types
Customizable: Allows for custom serialization and deserialization methods
Integration with Flask: Works seamlessly with Flask for data validation and transformation

YAML: The Missing Battery in Python – Real Python

3. PyYAML

PyYAML is a YAML parser and emitter for Python

Key Features

YAML Parsing: Parses YAML files and converts them to Python objects
YAML Emission: Converts Python objects to YAML format
Flexible Configuration: Supports custom tags and types for parsing and emitting YAML
Integration: Easily integrates with other Python libraries for data manipulation

How to Make the Most of Pydantic. Explore techniques for data contract… | by Pere Miquel Brull | Towards Data Science

4. Pydantic

Pydantic is a data validation and settings management library using Python type annotations.

Key Features

Type Annotations: Uses Python type annotations for data validation and transformation
Automatic Parsing: Automatically parses and validates input data
Custom Validators: Supports custom validation methods
Integration with FastAPI: Works seamlessly with FastAPI for data validation and serialization

Comparison

Pandas: Best for extensive data manipulation and analysis tasks with flexible data handling.
Marshmallow: Ideal for schema-based data validation and transformation with serialization and deserialization support.
PyYAML: Suitable for applications needing YAML parsing and emission with flexible configuration.
Pydantic: Best for type-based data validation and transformation with support for custom validators.

Examples

Example 1: Pandas

Setup:

$ pip install pandas

Configuration:

import pandas as pd

def transform_data(data):
    df = pd.DataFrame(data)
    # Example transformation: Add a new column
    df['total'] = df['quantity'] * df['price']
    return df

data = [
    {'product': 'A', 'quantity': 10, 'price': 1.5},
    {'product': 'B', 'quantity': 5, 'price': 2.0},
]

Usage:

transformed_data = transform_data(data)
print(transformed_data)

Example 2: Marshmallow

Setup:

$ pip install marshmallow

Configuration:

from marshmallow import Schema, fields, post_load

class ProductSchema(Schema):
    product = fields.Str(required=True)
    quantity = fields.Int(required=True)
    price = fields.Float(required=True)
    total = fields.Float()

    @post_load
    def calculate_total(self, data, **kwargs):
        data['total'] = data['quantity'] * data['price']
        return data

product_schema = ProductSchema(many=True)

data = [
    {'product': 'A', 'quantity': 10, 'price': 1.5},
    {'product': 'B', 'quantity': 5, 'price': 2.0},
]

Usage:

transformed_data = product_schema.load(data) 
print(transformed_data)

Example 3: PyYAML

Setup:

$ pip install pyyaml

Configuration:

import yaml

def parse_yaml(yaml_str):
    return yaml.safe_load(yaml_str)

def emit_yaml(data):
    return yaml.safe_dump(data)

yaml_str = """
products:
  - product: A
    quantity: 10
    price: 1.5
  - product: B
    quantity: 5
    price: 2.0
"""

Usage:

parsed_data = parse_yaml(yaml_str)
print(parsed_data)

yaml_output = emit_yaml(parsed_data)
print(yaml_output)

Example 4: Pydantic

Setup:

$ pip install pydantic

Configuration:

from pydantic import BaseModel, Field, validator
from typing import List

class Product(BaseModel):
    product: str
    quantity: int
    price: float
    total: float = Field(default=0)

    @validator('total', always=True)
    def calculate_total(cls, v, values):
        return values['quantity'] * values['price']

class ProductList(BaseModel):
    products: List[Product]

data = {
    "products": [
        {"product": "A", "quantity": 10, "price": 1.5},
        {"product": "B", "quantity": 5, "price": 2.0}
    ]
}

Usage:

product_list = ProductList(**data)
print(product_list.json())

Eduardo Garza

Reusable Python Code Modules, Part 12 - Data Transformation and Formatting

Optimizing Data Flow: Building Reusable Data Transformation and Formatting Modules

Common Libraries and Tools

1. Pandas

Key Features

2. Marshmallow

Key Features

3. PyYAML

Key Features

4. Pydantic

Key Features

Comparison

Examples

Example 1: Pandas

Example 2: Marshmallow

Example 3: PyYAML

Example 4: Pydantic