HackMD - Collaborative Markdown Knowledge Base

### Objective Enable pandas-profiling to use a spark backend in order to profile spark dataframes ### Constraints * As much as possible, use spark.sql and native spark dataframe functions. Do not bring data to local - because if we can, we could just use .toPandas() and apply usual profile report. * Also, we need to assume that we can only perform read operations on the spark dataframe, and not modifying/write functions so as not to inadvertently crash the user's spark server. ### Broad implementation goal We will try our best to replace all functions that operate on a pandas dataframe to spark functions operating on spark dataframe, without interfering with the rest of the result flow. This ensures that we can retain as much of the config builder,report builder and visualisation code as possible. To do this, we can identify the main calls that actually operate on the dataframe itself in the ProfileReport, and modify only these functions. They are : 1. describe_df(self.title, self.df, self._sample) 2. hash_dataframe(df) 3. preprocess(df) For describe_df(self.title, self.df, self.\_sample) : we will need to replace all pandas functions within to spark functions. The specific proposed is outlined below. For hash_dataframe(df), we will need to figure out a way to generate a hash a spark dataframe (still unsure about this) For preprocess(df), this method modifies the original dataframe. Since we probably can't do this on spark efficiently (modify and write into spark backend), I'd propose to not rewrite this function for spark. ### Proposed Phases of Function Development Since there is quite a lot of work to do in describe_df and hash_dataframe, we can break it down into two phases - in the first phase, we will develop an end to end pipeline that only handles spark numeric types. Once that is done, we can start supporting more data types. Phases : 1. Develop end to end ProfileReport function that handles only numeric spark rows (simplest) 2. Handle more complex spark datatypes (date, boolean, url) ### Implementation Analysis In the following implementation section, we map out the actual operations and function calls that need to be replaced. There are 7 inner function calls within describe() that we will need to modify. These functions are the main profiling tools and would comprise the main bulk of the work. We have to look at each function and their nested funtion calls, and recursively do the same analysis for every nested function call. How the section is indexed - the 7 main functions of describe : * 1. get_series_descriptions(df, pbar) * 2. calculate_correlation(df, variables, correlation_name) * 3. get_scatter_matrix(df, variables) * 4. get_table_stats(df, variable_stats) * 5. get_missing_diagrams(df, table_stats) * 6. get_sample(df) * 7. get_duplicates(df, supported_columns) These are indexed from 1-7. Nested function calls are given a further indexing of 1.1, 2.3.1 etc... the root describe function is given the index 0. . ### Implementation [WIP] ____ #### 0. def describe(title: str, df: pd.DataFrame, sample: Optional[dict] = None) -> dict: """Calculate the statistics for each series in this DataFrame. from pandas_profiling.models.describe.py Args: title: report title df: DataFrame. sample: optional, dict with custom sample Returns: This function returns a dictionary containing: - table: overall statistics. - variables: descriptions per series. - correlations: correlation matrices. - missing: missing value diagrams. - messages: direct special attention to these patterns in your data. - package: package details. """ Signature Changes : We will likely need to get a copy of the dataframe as well as the SparkSession object so that we can run spark.sql("...") functions Pandas to Spark Changes (old -> new) : * df.columns -> spark_df.schema.names * df.empty -> df.head(1).isEmpty Function calls: * 1. get_series_descriptions(df, pbar) * 2. [to-do] calculate_correlation(df, variables, correlation_name) * 3. [to-do] get_scatter_matrix(df, variables) * 4. [to-do] get_table_stats(df, variable_stats) * 5. [to-do] get_missing_diagrams(df, table_stats) * 6. [to-do] get_sample(df) * 7. [to-do] get_duplicates(df, supported_columns) Final Implementation Plan : TO-DO ___ #### 1. def get_series_descriptions(df, pbar): """ this function is mainly a wrapper around describe_1d from _pandas_profiling.models.summary.py_""" Pandas to Spark Changes (old -> new) : * None Function calls : * 1.1 describe_1d Final Implementation Plan : No change ____ #### 1.1 def describe_1d(series: pd.Series) -> dict: """Describe a series (infer the variable type, then calculate type-specific values). from pandas_profiling.models.summary.py Args: series: The Series to describe. Returns: A Series containing calculated series description values. """ Pandas to Spark Changes (old -> new) : * df.columns -> spark_df.schema.names * df.empty -> df.head(1).isEmpty * series.fillna(np.nan) -> do not replace NaN as we have no way of operating on the original dataframe. We could save a copy to local if needed but not preferred Function Calls : * 1.1.1[in prog] get_var_type(series) * 1.1.2[to-do] describe_supported(series, series_description) * 1.1.3 [to-do] describe_unsupported(series, series_description) * 1.1.4 [to-do] histogram compute(finite_values, n_unique, name="histogram") * 1.1.5 [to-do] numeric_stats_pandas(series: pd.Series) * 1.1.6 [to-do] numeric_stats_numpy(present_values) * 1.1.7 [to-do] describe_numeric_1d(series: pd.Series, series_description: dict) * 1.1.8 [Proposed Phase 2] describe_date_1d(series: pd.Series, series_description: dict) * 1.1.9 [Proposed Phase 2] describe_categorical_1d(series: pd.Series, series_description: dict) * 1.1.10 [Proposed Phase 2] describe_url_1d(series: pd.Series, series_description: dict) * 1.1.11 [Proposed Phase 2] describe_file_1d(series: pd.Series, series_description: dict) * 1.1.12 [Proposed Phase 2] describe_path_1d(series: pd.Series, series_description: dict) * 1.1.13 [Proposed Phase 2] describe_image_1d(series: pd.Series, series_description: dict) * 1.1.14 [Proposed Phase 2] describe_boolean_1d(series: pd.Series, series_description: dict) Final Implementation Plan : TO-DO ____ #### 1.1.1 def get_var_type(series) """Get the variable type of a series. from pandas_profiling.models.summary.py Args: series: Series for which we want to infer the variable type. Returns: The series updated with the variable type included. """ Function calls : * 1.1.1.1 [in prog] get_counts(series) * 1.1.1.2 [Proposed Phase 2] is_boolean(series, series_description) * 1.1.1.3 is_numeric(series, series_description) * 1.1.1.4 [Proposed Phase 2] is_date(series) * 1.1.1.5 [Proposed Phase 2] is_url(series, series_description) * 1.1.1.6 [Proposed Phase 2] is_path(series, series_description) Final Implementation Plan : TO-DO .... work in progress