UDF function to check whether my input dataframe has duplicate columns or not using pyspark

By : s .rose
Date : September 17 2020, 09:00 AM
I wish this helpful for you You don't need any UDF, you simple need a Python function. The check will be in Python not in JVM. So, as @Santiago P said you can use checkDuplicate ONLY
code :
    def checkDuplicate(df):
        return len(set(df.columns)) == len(df.columns) 

PySpark Dataframe identify distinct value on one column based on duplicate values in other columns

By : user5858117
Date : March 29 2020, 07:55 AM
seems to work fine I have a pyspark dataframe like: where c1,c2,c3,c4,c5,c6 are the columns , This is actually very simple, let's create some data first :
code :
schema = ['c1','c2','c3','c4','c5','c6']

rdd = sc.parallelize(["a,x,y,z,g,h","b,x,y,z,l,h","c,x,y,z,g,h","d,x,f,y,g,i","e,x,y,z,g,i"]) \
        .map(lambda x : x.split(","))

df = sqlContext.createDataFrame(rdd,schema)
# +---+---+---+---+---+---+
# | c1| c2| c3| c4| c5| c6|
# +---+---+---+---+---+---+
# |  a|  x|  y|  z|  g|  h|
# |  b|  x|  y|  z|  l|  h|
# |  c|  x|  y|  z|  g|  h|
# |  d|  x|  f|  y|  g|  i|
# |  e|  x|  y|  z|  g|  i|
# +---+---+---+---+---+---+
from pyspark.sql.functions import *

dupes = df.groupBy('c2','c3','c4','c5') \ 
          .agg(collect_list('c1').alias("c1s"),count('c1').alias("count")) \ # we collect as list and count at the same time
          .filter(col('count') > 1) # we filter dupes

df2 = dupes.select(explode("c1s").alias("c1_dups"))

# +-------+
# |c1_dups|
# +-------+
# |      a|
# |      c|
# |      e|
# +-------+

Pyspark remove duplicate columns in a dataframe

By : zer0space
Date : March 29 2020, 07:55 AM
I wish did fix the issue. You can drop the duplicate columns by comparing all unique permutations of columns that potentially be identical. You can use the itertools library and combinations to calculate these unique permutations:
code :
from itertools import combinations
#select columns that can be identical, can also be a hardcoded list
L = filter(lambda x: 'TYPE' in x,df1.columns) 
#we only want to do pairwise comparisons, so the second value of combinations is 2
permutations = [(map(str, comb)) for comb in combinations(L, 2)]
columns_to_drop = set()
for permutation in permutations:
    if df1.filter(df1[permutation[0]] != df1[permutation[1]]).count()==0:
df.select([c for c in df.columns if c not in columns_to_drop]).show()
|  1|   A|  X1|
|  2|   B|  X2|
|  3|   B|  X3|

Need to remove duplicate columns from a dataframe in pyspark

By : chugeluang
Date : March 29 2020, 07:55 AM
To fix the issue you can do You might have to rename some of the duplicate columns in order to filter the duplicated. otherwise columns in duplicatecols will all be de-selected while you might want to keep one column for each. Below is one way which might help:
code :
# an example dataframe
cols = list('abcaded')
df_ticket = spark.createDataFrame([tuple(i for i in range(len(cols)))], cols)
>>> df_ticket.show()
#|  a|  b|  c|  a|  d|  e|  d|
#|  0|  1|  2|  3|  4|  5|  6|

# unless you just want to filter a subset of all duplicate columns
# this list is probably not useful
duplicatecols = list('ad')

# create cols_new so that seen columns will have a suffix '_dup'
cols_new = [] 
seen = set()
for c in df_ticket.columns:
    cols_new.append('{}_dup'.format(c) if c in seen else c)

>>> cols_new
#['a', 'b', 'c', 'a_dup', 'd', 'e', 'd_dup']
>>> df_ticket.toDF(*cols_new).select(*[c for c in cols_new if not c.endswith('_dup')]).show()
#|  a|  b|  c|  d|  e|
#|  0|  1|  2|  4|  5|

Pyspark dataframe joins with few duplicated column names and few without duplicate columns

By : user3460342
Date : March 29 2020, 07:55 AM
I hope this helps . There is no way to do it: behind the scenes an equi-join (colA == colB) where the condition is given as a (sequence of) string(s) (which is called a natural join) is executed as if it were a regular equi-join (source) so
code :
            frame1.shared_column == frame2.shared_column, 

Pyspark Dataframe - How to concatenate columns based on array of columns as input

By : user3711791
Date : March 29 2020, 07:55 AM
I wish this helpful for you You can unpack the cols using (*). In the pyspark.sql docs, if any functions have (*cols), it means that you can unpack the cols. For concat:
code :
from pyspark.sql import functions as F
arr = ["col1", "col2", "col3"]
newDF = rawDF.select(F.concat(*(F.col(col) for col in arr))).exceptAll(updateDF.select(F.concat(*(F.col(col) for col in arr))))
df3 = df2.join(df1, F.concat(*(F.col(col) for col in arr)) == df1.col5 )
