Spark AutoMapper¶
Fluent API to map data from one view to another in Apache Spark.
Uses native Spark functions underneath so it is just as fast as hand writing the transformations.
Since this is just Python, you can use any Python editor. Everything is typed using Python typings so most editors will auto-complete and warn you when you do something wrong.
Typical Usage¶
First create an AutoMapper (
AutoMapper
). The two required parameters are:source_view: The Spark view to read the source data from
view: The destination Spark view to store the data. This view should not already exist.
Create the mappings. There are two kind of mappings:
In each mapping, access a column in the source view by using A.column(“column_name”) (
column()
) or specify a literal value with A.text(“apple”) (text()
).Add transformation functions to A.column(“column_name”) to transform the column. (
AutoMapperDataTypeBase
)
Example¶
AutoMapper with Strongly typed classes (Recommended)
mapper = AutoMapper(
view="members",
source_view="patients"
).complex(
MyClass(
name=A.column("last_name"),
age=A.column("my_age").to_number()
)
)
The strongly typed class can be created as below (the constructor accepts whatever parameters you want and the class must implement the get_schema() method):
class MyClass(AutoMapperDataTypeComplexBase):
def __init__(
self, name: AutoMapperTextLikeBase, age: AutoMapperNumberDataType
) -> None:
super().__init__(name=name, age=age)
def get_schema(
self, include_extension: bool
) -> Optional[Union[StructType, DataType]]:
schema: StructType = StructType(
[
StructField("name", StringType(), False),
StructField("age", LongType(), True),
]
)
return schema
AutoMapper with just columns
mapper = AutoMapper(
view="members",
source_view="patients",
).columns(
dst1=A.column("src1"),
dst2=AutoMapperList(["address1"]),
dst3=AutoMapperList(["address1", "address2"]),
dst4=AutoMapperList([A.complex(use="usual", family=A.column("last_name"))]),
)
Running an AutoMapper¶
An AutoMapper is a Spark Transformer. So you can call the transform function on an AutoMapper and pass in a dataframe.
result_df: DataFrame = mapper.transform(df=df)
NOTE: AutoMapper ignores the dataframe passed in since it uses views. (This allows more flexibility)
A view can be created in Spark by calling the createOrReplaceTempView function.
df.createOrReplaceTempView("patients")
For example, you can create a view:
spark_session.createDataFrame(
[
(1, "Qureshi", "Imran", 45),
(2, "Vidal", "Michael", 35),
],
["member_id", "last_name", "first_name", "my_age"],
).createOrReplaceTempView("patients")
Contents:¶
- API Reference
spark_auto_mapper
spark_auto_mapper.automappers
spark_auto_mapper.automappers.automapper
spark_auto_mapper.automappers.automapper_analysis_exception
spark_auto_mapper.automappers.automapper_base
spark_auto_mapper.automappers.automapper_exception
spark_auto_mapper.automappers.check_schema_result
spark_auto_mapper.automappers.column_spec_wrapper
spark_auto_mapper.automappers.columns
spark_auto_mapper.automappers.complex
spark_auto_mapper.automappers.container
spark_auto_mapper.automappers.with_column
spark_auto_mapper.automappers.with_column_base
spark_auto_mapper.data_types
spark_auto_mapper.data_types.complex
spark_auto_mapper.data_types.amount
spark_auto_mapper.data_types.array
spark_auto_mapper.data_types.array_base
spark_auto_mapper.data_types.array_distinct
spark_auto_mapper.data_types.array_max
spark_auto_mapper.data_types.boolean
spark_auto_mapper.data_types.coalesce
spark_auto_mapper.data_types.column
spark_auto_mapper.data_types.column_wrapper
spark_auto_mapper.data_types.concat
spark_auto_mapper.data_types.data_type_base
spark_auto_mapper.data_types.date
spark_auto_mapper.data_types.date_format
spark_auto_mapper.data_types.datetime
spark_auto_mapper.data_types.decimal
spark_auto_mapper.data_types.expression
spark_auto_mapper.data_types.field
spark_auto_mapper.data_types.filter
spark_auto_mapper.data_types.first
spark_auto_mapper.data_types.first_valid_column
spark_auto_mapper.data_types.flatten
spark_auto_mapper.data_types.float
spark_auto_mapper.data_types.hash
spark_auto_mapper.data_types.if_
spark_auto_mapper.data_types.if_column_exists
spark_auto_mapper.data_types.if_not
spark_auto_mapper.data_types.if_not_null
spark_auto_mapper.data_types.if_not_null_or_empty
spark_auto_mapper.data_types.if_regex
spark_auto_mapper.data_types.join_using_delimiter
spark_auto_mapper.data_types.list
spark_auto_mapper.data_types.literal
spark_auto_mapper.data_types.lpad
spark_auto_mapper.data_types.map
spark_auto_mapper.data_types.null_if_empty
spark_auto_mapper.data_types.number
spark_auto_mapper.data_types.regex_extract
spark_auto_mapper.data_types.regex_replace
spark_auto_mapper.data_types.split_by_delimiter
spark_auto_mapper.data_types.substring
spark_auto_mapper.data_types.substring_by_delimiter
spark_auto_mapper.data_types.text_like_base
spark_auto_mapper.data_types.transform
spark_auto_mapper.data_types.trim
spark_auto_mapper.data_types.unix_timestamp
spark_auto_mapper.helpers
spark_auto_mapper.type_definitions