Quick fix in removing duplicates, Need hands on pyspark data stratification

I'm looking for some one whos expertise in pyspark data stratification, I have pseudo code available and from the data set, I'm looking to remove duplicates from post strata.

Here's is sample set of data I have created a bin field based on agg_readings. And the Data is so huge with close to 320 Million records stored in hive with parquet format. Of the 320Million, I'm looking to get 5 Million based on stratification. Below is the sample snippet

I have used sampleBy here to fetch the stratified based on two columns. ( Columns are - mnth_src_fld & bin). All I'm looking at the stratified data piece is to get gen_rnd_id unique values across the entire data post stratification, But unfortunately I'm not getting unique gen_rnd_id's. For instance, here in the below sample id's "1805040053", "2352639960" are still getting populated. Based on one source about usage stratification on multiple columns, I have came up with below code, not sure how to get id's from across all bins(1, 2, & 3) to be unique .

gen_rnd_id mnth_src_fld agg_total_readings bin insert_dt

1810269893 AUG- GNL DVCS 1421.36 2 9/22/2022

2360812758 AUG- GNL DVCS 1421.32 2 9/22/2022

2561885533 AUG- GNL DVCS 1421.23 1 9/22/2022

2360812759 AUG- GNL DVCS 1421.17 1 9/22/2022

1460501167 AUG- GNL DVCS 1421.11 2 9/22/2022

1515893360 AUG- GNL DVCS 1420.71 2 9/22/2022

1805040053 AUG- GNL DVCS 1419.87 2 9/22/2022

2436175429 AUG- GNL DVCS 1419.63 2 9/22/2022

2941769711 AUG- GNL DVCS 1419.38 2 9/22/2022

2352639960 AUG- GNL DVCS 1417.74 2 9/22/2022

2600364039 AUG- GNL DVCS 643.76 1 9/22/2022

2803486093 AUG- GNL DVCS 643.65 1 9/22/2022

2752468042 AUG- GNL DVCS 643.21 1 9/22/2022

2352639960 AUG-LIR 693797.58 3 9/22/2022

1805040053 AUG-PRO 361753.83 3 9/22/2022

2223875595 AUG-REFRIGERATE 319019.03 3 9/22/2022

2243916002 AUG-REFRIGERATE 230745.32 3 9/22/2022

def get_stratified_split_multiple_columns(input_df, col_name1, col_name2, seed_value, train_frac):

merged_col_name = "both_labels"

input_df = [login to view URL](merged_col_name, [login to view URL]([login to view URL](col_name1), [login to view URL]('_#_@_#_'),

[login to view URL](col_name2)))

print(f"DEBUG: Processing Stratification on input_df {input_df} dataset with {col_name1} & {col_name2} combined fields, seed_value={seed_value}, train_frac={train_frac}\n")

fractions1 = [login to view URL](merged_col_name).distinct().withColumn("fraction",

[login to view URL](train_frac)).[login to view URL]()

train_df = [login to view URL](merged_col_name, fractions1, seed_value)

# Delete the merged_col_name

train_df = [login to view URL](merged_col_name)

return train_df


train_df=get_stratified_split_multiple_columns(load_df, 'mnth_src_fld', 'bin', seed_value=10, train_frac=train_frac_val)

Aptitudini: PySpark, Spark, Python

Despre client:
( 0 recenzii ) Flanders, United States

ID Proiect: #34732569

4 freelanceri licitează în medie 21$ pentru acest proiect


Hi, I've read your description carefully. I have full experience with Python, PySpark... I've also worked on several similar projects. ✅So I can complete your project with high quality on time. ✅We can discuss about bu Mai multe

%bids___i_sum_sub_32%%project_currencyDetails_sign_sub_33% USD în 1 zi
(11 recenzii)

I need to analyze the data and requirement But I can propose a md5 hash key to uniquely identify the rows.

%bids___i_sum_sub_35%%project_currencyDetails_sign_sub_36% USD în 2 zile
(0 recenzii)

I have 4 years proffesional experience in pyspark, I have done a lot of works like yours. I can do your project really fast, please feel free to contact me

%bids___i_sum_sub_35%%project_currencyDetails_sign_sub_36% USD în 4 zile
(0 recenzii)

I have many years of experiences in PySpark, so I will help you solve it. Please chat with me about your project

%bids___i_sum_sub_32%%project_currencyDetails_sign_sub_33% USD în 1 zi
(0 recenzii)