logo
down
shadow

how to iterate on column in pyspark dataframe based on unique records and non na values


how to iterate on column in pyspark dataframe based on unique records and non na values

By : Kumar Ujjawal
Date : October 18 2020, 06:10 PM
this one helps. I think the problem arises from the last line. If I understand your problem correctly, this should be what you're looking for:
code :
 temp1 = sampdf[(sampdf['area'] == i) | (sampdf['area'] == "Unknown")]


Share : facebook icon twitter icon
Pyspark dataframe: Counting of unique values in a column, independently co-ocurring with values in other columns

Pyspark dataframe: Counting of unique values in a column, independently co-ocurring with values in other columns


By : RedDevil
Date : March 29 2020, 07:55 AM
like below fixes the issue Here is one way to approach this problem. For each row, create 2 new columns:
Column 'RS': The set of sources for the 'Regulator' Column 'TS': The set of sources for the 'Target'
code :
from pyspark.sql Window
import pyspark.sql.functions as f
cols = ["Regulator", "Target", "Source"]
data = [
    ('m', 'A', 'x'),
    ('m', 'B', 'x'),
    ('m', 'C', 'z'),
    ('n', 'A', 'y'),
    ('n', 'C', 'x'),
    ('n', 'C', 'z')
]

df = sqlCtx.createDataFrame(data, cols)
df = df.withColumn(
    'RS',
    f.collect_set(f.col('Source')).over(Window.partitionBy('Regulator'))
)

df = df.withColumn(
    'TS',
    f.collect_set(f.col('Source')).over(Window.partitionBy('Target'))
)
df.sort('Regulator', 'Target', 'Source').show()
#+---------+------+------+------+---------+
#|Regulator|Target|Source|    TS|       RS|
#+---------+------+------+------+---------+
#|        m|     A|     x|[y, x]|   [z, x]|
#|        m|     B|     x|   [x]|   [z, x]|
#|        m|     C|     z|[z, x]|   [z, x]|
#|        n|     A|     y|[y, x]|[y, z, x]|
#|        n|     C|     x|[z, x]|[y, z, x]|
#|        n|     C|     z|[z, x]|[y, z, x]|
#+---------+------+------+------+---------+
intersection_length_udf = f.udf(lambda u, v: len(set(u) & set(v)), IntegerType())

df = df.withColumn('No_sources', intersection_length_udf(f.col('TS'), f.col('RS')))

df.select('Regulator', 'Target', 'Source', 'No_sources')\
    .sort('Regulator', 'Target', 'Source')\
    .show()
#+---------+------+------+----------+
#|Regulator|Target|Source|No_sources|
#+---------+------+------+----------+
#|        m|     A|     x|         1|
#|        m|     B|     x|         1|
#|        m|     C|     z|         2|
#|        n|     A|     y|         2|
#|        n|     C|     x|         2|
#|        n|     C|     z|         2|
#+---------+------+------+----------+
How to select records in one pyspark dataframe based on unique records in other or with value as Unknown

How to select records in one pyspark dataframe based on unique records in other or with value as Unknown


By : user2792051
Date : March 29 2020, 07:55 AM
may help you . Try this -
I kept join, filter and union in separate dataframe for easy explanation. These could be combined.
code :
from pyspark.sql import functions as psf

join_condition = [psf.col('a.area') == psf.col('b.area')]


df1 = fix_map.alias("a").join(con_melt.alias("b"), join_condition).select('a.id','a.area','a.type')

df2 = con_melt.filter("area == 'Unknown'").select('id','area','type')

df1.union(df2).show()

#+------+-------+----+
#|    id|   area|type|
#+------+-------+----+
#|227149|    510| mob|
#|122350|Unknown| fix|
#+------+-------+----+
Group column of pyspark dataframe by taking only unique values from two columns

Group column of pyspark dataframe by taking only unique values from two columns


By : user3142013
Date : March 29 2020, 07:55 AM
I hope this helps you . This looks like a connected components problem. There are a couple ways you can go about doing this.
1. GraphFrames
code :
hashed_df = df.withColumn('hash', F.sort_array(F.array(F.col('fruit'), F.col('fruits'))))
distinct_df = hashed_df.dropDuplicates(['hash'])
revert_df = distinct_df.withColumn('fruit', F.col('hash')[0]) \
    .withColumn('fruits', F.col('hash')[1])
grouped_df = revert_df.groupBy('fruit').agg(F.collect_list('fruits').alias('group'))
pyspark dataframe data transformation with unique column values

pyspark dataframe data transformation with unique column values


By : bcduggan
Date : March 29 2020, 07:55 AM
This might help you I am trying to learn pysaprk with sql functionalities or by dataframe group by solution itself. , You can use collect_set as
code :
df.groupBy("Name","Place").agg(concat_ws(",",collect_set("Product")))
Pyspark - How to insert records in dataframe 1, based on a column value in dataframe2

Pyspark - How to insert records in dataframe 1, based on a column value in dataframe2


By : user3657117
Date : March 29 2020, 07:55 AM
wish helps you Would like to provide the code snippet, so maybe it would be useful to some.
code :
df1= sqlContext.createDataFrame([("xxx1","81A01","TERR NAME 01"),("xxx1","81A01","TERR NAME 02"), ("xxx1","81A01","TERR NAME 03")], ["zip_code","zone_code","territory_name"])
df2= sqlContext.createDataFrame([("xxx1","","","NY"), ("xxx1","","TERR NAME 99","NY")], ["zip_code","zone_code","territory_name","state"])

df1.createOrReplaceTempView('df1')
df2.createOrReplaceTempView('df2')

spark.sql(“select * from df1”)
+--------+---------+--------------+ 
|zip_code|zone_code|territory_name| 
+--------+---------+--------------+ 
| xxx1   | 81A01   | TERR NAME 01 | 
| xxx1   | 81A01   | TERR NAME 02 | 
| xxx1   | 81A01   | TERR NAME 03 | 
+--------+---------+--------------+ 

spark.sql(“select * from df2”)
+--------+---------+--------------+-----+ 
|zip_code|zone_code|territory_name|state| 
+--------+---------+--------------+-----+ 
| xxx1   |         |              | NY  | 
| xxx1   |         | TERR NAME 99 | NY  | 
+--------+---------+--------------+-----+

spark.sql("""select a.zip_code, b.zone_code, b.territory_name, a.state from df2 a 
            left join df1 b 
            on a.zip_code = b.zip_code 
            where a.territory_name = ''
            UNION
            select a.zip_code, b.zone_code, a.territory_name, a.state from df2 a 
            left join df1 b 
            on a.zip_code = b.zip_code 
            where a.territory_name != ''
            """).createOrReplaceTempView('df3')


spark.sql(“select * from df3”)
+--------+---------+--------------+-----+ 
|zip_code|zone_code|territory_name|state| 
+--------+---------+--------------+-----+ 
| xxx1   | 81A01   | TERR NAME 03 | NY  | 
| xxx1   | 81A01   | TERR NAME 99 | NY  |  
| xxx1   | 81A01   | TERR NAME 01 | NY  | 
| xxx1   | 81A01   | TERR NAME 02 | NY  | 
+--------+---------+--------------+-----+
Related Posts Related Posts :
  • Help writing database queries for derby?
  • Issues with Trac (installed with BitNami)
  • Using Ghostscript in server mode to convert PDFs to PNGs
  • What's the case when using software licensed under GPL or LGPL
  • Is there any less or more convenient iDoc Script editor for Oracle 10g UCM?
  • What are the most popular RSS readers? (software/web apps)
  • MPICH vs OpenMPI
  • Why are not all texts of my MFC applicatiopn displayed using ClearType?
  • Should I focus on code quality while Rapid prototyping?
  • how to get response in QtWebKit
  • Silverlight - Access the Layout Grid's DataContext in a DataGrid CellTemplate's DataTemplate?
  • is it possible to set specific file extensions as exclusive check out only, with TFS
  • JasperReports: is it possible to use multiple data sources, or if not, to use collections in parameters?
  • Is there a 2d sprite library for webgl?
  • Error: NAND: could not write file /hd2/android-sdk-linux_86/add-ons/google_apis-7_r01/images//system.img, File exists
  • how to configure and use jstl in websphere
  • What does => mean in Ada?
  • Maven best practice for generating artifacts for multiple environments [prod, test, dev] with CI/Hudson support?
  • Maven best practice for generating multiple jars with different/filtered classes?
  • Usage of # in Pascal
  • Generics and polymorphism
  • Concurrent call to conversation
  • polymorphism relates inheritance
  • Maximum values in wherein clause of mysql
  • Forbid developer to commit code because of making weekly build
  • Automatically adjustment of wxPython Frame Size
  • how to import a file into mathematica and reference a column by header name
  • How to integrate junit/pmd/findbugs report into hudson build email?
  • In Symfony, sharing data across subdomains
  • In MediaWiki, is it possible to capture user search terms that don't return results?
  • How to check in what language a program (.exe) has been written. How to view the code?
  • Can I automap a tree hierarchy with Fluent NHibernate?
  • How to adjust the distribution of values in a random data stream?
  • Optimizing SMO with RBFKernel (C and gamma)
  • How to wait for one second on an 8051 microcontroller?
  • Major sites browser incompatibilities
  • What tools do you use to manage Change requests and Bug Reports
  • Silverlight -RIA Services-This EntitySet of type <> does not support the 'Add' operation
  • How to monitor windows manchine in grafana using prometheus?
  • Produce new word2vec model from existing one
  • Migrating Rails from Asset Pipeline to Webpacker: Uncaught ReferenceError: $ is not defined in rails-ujs.js
  • Extract lines with string and variable number pattern
  • Configuration priority - best practise
  • WebAssembly dynamic module unloading
  • Call SWS Via Sabre Red Workspace From Native API Bridge Application
  • How to set query timeout when using Presto CLI?
  • What's the difference between agent.add() and conv.ask() on dialogflow
  • Pymodbus - Read input register of Energy meter over rs485 on uart of raspberry pi3
  • Execute bash script on a dataproc cluster from a composer
  • Gremlin: select vertex based on comparison of two property values
  • How do you createRef in Suave Fable?
  • I am having trouble building Azerothcore on Windows 10 Home, VS 2017
  • Why is testcafe-docker.sh ignoring app-init-delay parameter?
  • DynamoDB Adjacency List Pattern
  • Is there a way for my aplication to detect beacons in Powerapps?
  • "Initialize interactive with Project" is missing for .Net Core Projects in Visual Studio 2019
  • Cosmos db Order by on 'computed field'
  • let a rpm to automatically install centos-release-scl-rh
  • What is the "Stage" folder inside MarkLogic Installed Directory? How does MarkLogic use this folder?
  • Implement requestHooks in cucumber/testCafe
  • shadow
    Privacy Policy - Terms - Contact Us © 35dp-dentalpractice.co.uk