spark_auto_mapper.automappers.automapper

Module Contents

Classes

AutoMapper

Main AutoMapper Class

class spark_auto_mapper.automappers.automapper.AutoMapper(keys=None, view=None, source_view=None, keep_duplicates=False, drop_key_columns=True, checkpoint_after_columns=None, checkpoint_path=None, reuse_existing_view=False, use_schema=True, include_extension=False, include_null_properties=False, use_single_select=True, verify_row_count=True, skip_schema_validation=['extension'], skip_if_columns_null_or_empty=None, keep_null_rows=False, filter_by=None, logger=None, check_schema_for_all_columns=False, copy_all_unmapped_properties=False, copy_all_unmapped_properties_exclude=None, log_level=None)

Bases: spark_auto_mapper.automappers.container.AutoMapperContainer

Main AutoMapper Class

Creates an AutoMapper

Parameters
  • keys (Optional[List[str]]) – joining keys

  • view (Optional[str]) – view to return

  • source_view (Optional[str]) – where to load the data from

  • keep_duplicates (bool) – whether to leave duplicates at the end

  • drop_key_columns (bool) – whether to drop the key columns at the end

  • checkpoint_after_columns (Optional[int]) – checkpoint after how many columns have been processed

  • checkpoint_path (Optional[Union[str, pathlib.Path]]) – Path where to store the checkpoints

  • reuse_existing_view (bool) – If view already exists, whether to reuse it or create a new one

  • use_schema (bool) – apply schema to columns

  • include_extension (bool) – By default we don’t include extension elements since they take up a lot of schema. If you’re using extensions then set this

  • include_null_properties (bool) – If you want to include null properties

  • use_single_select (bool) – This is a faster way to run the AutoMapper since it will select all the columns at once. However this makes it harder to debug since you don’t know what column failed

  • verify_row_count (bool) – verifies that the count of rows remains the same before and after the transformation

  • skip_schema_validation (List[str]) – skip schema checks on these columns

  • skip_if_columns_null_or_empty (Optional[List[str]]) – skip creating the record if any of these columns are null or empty

  • keep_null_rows (bool) – whether to keep the null rows instead of removing them

  • filter_by (Optional[str]) – (Optional) SQL expression that is used to filter

  • copy_all_unmapped_properties (bool) – copy any property that is not explicitly mapped

  • copy_all_unmapped_properties_exclude (Optional[List[str]]) – exclude these columns when copy_all_unmapped_properties is set

  • logger (Optional[logging.Logger]) – logger used to log informational messages

  • check_schema_for_all_columns (bool) –

  • log_level (Optional[Union[int, str]]) –

transform_with_data_frame(self, df, source_df, keys)

Internal function called by base class to transform the data frame

Parameters
  • df (pyspark.sql.DataFrame) – destination data frame

  • source_df (Optional[pyspark.sql.DataFrame]) – source data frame

  • keys (List[str]) – key columns

Return type

pyspark.sql.DataFrame

:return data frame after the transform

transform(self, df)

Uses this AutoMapper to transform the specified data frame and return the new data frame

Parameters

df (pyspark.sql.DataFrame) – source data frame

Return type

pyspark.sql.DataFrame

:returns destination data frame

columns(self, **kwargs)

Adds mappings for columns

Example
mapper = AutoMapper(

view=”members”, source_view=”patients”, keys=[“member_id”], drop_key_columns=False,

).columns(

dst1=”src1”, dst2=AutoMapperList([“address1”]), dst3=AutoMapperList([“address1”, “address2”]), dst4=AutoMapperList([A.complex(use=”usual”, family=A.column(“last_name”))]),

)

Parameters

kwargs (spark_auto_mapper.type_definitions.defined_types.AutoMapperAnyDataType) – A dictionary of mappings

Returns

The same AutoMapper

Return type

AutoMapper

complex(self, entity)

Adds mappings for an entity

Example
mapper = AutoMapper(

view=”members”, source_view=”patients”, keys=[“member_id”], drop_key_columns=False,

).complex(
MyClass(

name=A.column(“last_name”), age=A.number(A.column(“my_age”))

)

)

Parameters

entity (spark_auto_mapper.data_types.complex.complex_base.AutoMapperDataTypeComplexBase) – An AutoMapper type

Returns

The same AutoMapper

Return type

AutoMapper

__repr__(self)

Display for debugger

Returns

string representation for debugger

Return type

str

to_debug_string(self, source_df=None)

Displays the automapper as a string

Parameters

source_df (Optional[pyspark.sql.DataFrame]) – (Optional) source data frame

Returns

string representation

Return type

str

property column_specs(self)

Useful to show in debugger

:return dictionary of column specs

Return type

Dict[str, pyspark.sql.Column]

__str__(self)

Return str(self).

Return type

str