logo
down
shadow

APACHE-SPARK QUESTIONS

Hive query to find the count for the weeks in middle
Hive query to find the count for the weeks in middle
I hope this helps . Although this answer is in Scala, Python version will look almost the same can be easily converted.Step 1:
TAG : apache-spark
Date : October 21 2020, 06:10 PM , By : Mariano Lescano
How to count the null,na and nan values in each column of pyspark dataframe
How to count the null,na and nan values in each column of pyspark dataframe
hop of those help? Dataframe as na,Nan and Null values . Schema (Name:String,Rol.No:Integer,Dept:String Example: , Use when()
TAG : apache-spark
Date : October 19 2020, 06:10 PM , By : chris greger
Elasticsearch Spark parse issue - cannot parse value [X] for field [Y]
Elasticsearch Spark parse issue - cannot parse value [X] for field [Y]
Any of those help I've created a sample document based on your data in ES 6.4/Spark 2.1 version and made use of the below code, in order to read GenerateTime field as text instead of date type in spark. Mapping in ES
TAG : apache-spark
Date : October 19 2020, 06:10 PM , By : Amit Bhatt
PySpark/Glue: When using a date column as a partition key, its always converted into String?
PySpark/Glue: When using a date column as a partition key, its always converted into String?
With these it helps This is a known behavior of paquet. You can add the following line before reading the parquet file to omit this behavior:
TAG : apache-spark
Date : October 16 2020, 06:10 AM , By : nurul shaumi
design- Can Kafka Producer written as Spark-job?
design- Can Kafka Producer written as Spark-job?
wish helps you Spark provides connector for Kafka through which you can connect to any of the kafka topic available in your cluster. Once you get connected to your Kafka topic you can read or write the data. Example code:
TAG : apache-spark
Date : October 14 2020, 02:00 AM , By : Lluís Formiga i Fana
Are the data nodes in an HDFS the same as the executor nodes in a spark cluster?
Are the data nodes in an HDFS the same as the executor nodes in a spark cluster?
I wish this help you I always think those concepts from a standalone perspective firstly, then to a cluster perspective. Considering a single machine (and you will also run Spark in local mode), DataNode and NameNode are just pieces of software to su
TAG : apache-spark
Date : October 13 2020, 07:00 PM , By : Kentoy
Optimal file size and parquet block size
Optimal file size and parquet block size
wish of those help Before talking about the parquet side of the equation, one thing to consider is how the data will be used after you save it to parquet. If it's going to be read/processed often, you may want to consider what are the access patterns
TAG : apache-spark
Date : October 13 2020, 12:00 PM , By : Andy Rybchuk
how to correctly configure maxResultSize?
how to correctly configure maxResultSize?
I wish this help you The following should do the trick. Also note that you have mis-spelled ("spark.executor.memories", "10g"). The correct configuration is 'spark.executor.memory'.
TAG : apache-spark
Date : October 13 2020, 07:00 AM , By : Shivang Gupta
Spark SQL nested JSON error "no viable alternative at input "
Spark SQL nested JSON error "no viable alternative at input "
I wish did fix the issue. It's because SQL column names are expected to start with a letter or some other characters like _, @ or but not a digit. Let's consider this simple example:
TAG : apache-spark
Date : October 12 2020, 02:00 PM , By : RichieRich
Cassandra partition size vs partitions count while processing a large part of the table
Cassandra partition size vs partitions count while processing a large part of the table
I wish did fix the issue. As you suspect, planning to have just 31 partitions is a really bad idea for performance. The primary problem would be that the database cannot scale: When RF=3, there would be at most (under unlikely optimal conditions) 93
TAG : apache-spark
Date : October 12 2020, 02:00 PM , By : Shardul Samdurkar
java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/plans/logical/AnalysisHelper while writing delta-lake into
java.lang.NoClassDefFoundError: org/apache/spark/sql/catalyst/plans/logical/AnalysisHelper while writing delta-lake into
hope this fix your issue What is your Spark version? org/apache/spark/sql/catalyst/plans/logical/AnalysisHelper came about in 2.4.0. If you are using an older version, you will have this issue. In 2.4.0 https://github.com/apache/spark/tree/v2.4.0/sql
TAG : apache-spark
Date : October 12 2020, 11:00 AM , By : Chaim Paneth
Any clue how to join this spark-structured stream joins?
Any clue how to join this spark-structured stream joins?
I wish did fix the issue. AFAIK Spark structured streaming can't do joins after aggregations (or other non-map-like operations)https://spark.apache.org/docs/2.4.3/structured-streaming-programming-guide.htmlsupport-matrix-for-joins-in-streaming-querie
TAG : apache-spark
Date : October 12 2020, 11:00 AM , By : Vitaly Mush
What is a fast way to generate parquet data files with Spark for testing Hive/Presto/Drill/etc?
What is a fast way to generate parquet data files with Spark for testing Hive/Presto/Drill/etc?
This might help you I guess the main goal is to generate data, not to write it in a certain format.Let's start with a very simple example.
TAG : apache-spark
Date : October 12 2020, 03:00 AM , By : Cássio Henrique Cost
Changing bucket class(Regional/Multi Regional) in Google Cloud Storage connector in Spark
Changing bucket class(Regional/Multi Regional) in Google Cloud Storage connector in Spark
fixed the issue. Will look into that further From document: Cloud Dataproc staging bucket
TAG : apache-spark
Date : October 11 2020, 10:00 PM , By : Johnny Madden
How spark structured streaming consumers initiated and invoked while reading multi-partitioned kafka topics?
How spark structured streaming consumers initiated and invoked while reading multi-partitioned kafka topics?
may help you . If Kafka has more than one partition that means consumers can benefit from that by doing a certain task in parallel. In particular spark-streaming internally can speed up a job by increasing the num-executors parameter. That is tied to
TAG : apache-spark
Date : October 11 2020, 08:00 PM , By : Zain Hatim
How to handle small file problem in spark structured streaming?
How to handle small file problem in spark structured streaming?
Does that help We had a similar problem, too. After a lot of Googling, it seemed the generally accepted way was to write another job that every so often aggregates the many small files and writes them elsewhere in larger, consolidated files. This is
TAG : apache-spark
Date : October 11 2020, 08:00 PM , By : Mosa Alhadeed
Killing spark streaming job when no activity
Killing spark streaming job when no activity
I wish this help you Use a NoSQL Table like Cassandra or HBase to keep the counter. You can not handle Stream Polling inside a loop. Implement same logic using NoSQL or Maria DB and perform a Graceful Shutdown to your streaming Job if no activity is
TAG : apache-spark
Date : October 11 2020, 08:00 PM , By : Aswin
How to save spark dataframe to parquet without using INT96 format for timestamp columns?
How to save spark dataframe to parquet without using INT96 format for timestamp columns?
hop of those help? Reading spark code I have found the spark.sql.parquet.outputTimestampType property
TAG : apache-spark
Date : October 11 2020, 10:00 AM , By : Genesearch
How to pass external resouce yml /property file while running spark job on cluster?
How to pass external resouce yml /property file while running spark job on cluster?
I wish did fix the issue. you will have to use --file path to your file in spark-submit command to be able to pass any files. please note this issyntax for that is
TAG : apache-spark
Date : October 11 2020, 10:00 AM , By : James
SparkSQL Get all prefixes of a word
SparkSQL Get all prefixes of a word
this will help Technically it is possible but I doubt it will perform any better than a simple flatMap (if performance is the reason to avoid flatMap):
TAG : apache-spark
Date : October 11 2020, 09:00 AM , By : Weather Amber
Is it required to install spark on all the nodes of cluster
Is it required to install spark on all the nodes of cluster
this will help If you use yarn as manager on a cluster with multiple nodes you do not need to install spark on each node. Yarn will distribute the spark binaries to the nodes when a job is submitted.https://spark.apache.org/docs/latest/running-on-yar
TAG : apache-spark
Date : October 11 2020, 08:00 AM , By : Vinay G.Leeiyar
Spark policy for handling multiple watermarks
Spark policy for handling multiple watermarks
With these it helps If as far as I understand, you would like to know how multiple watermarks behave for join operations, right? I so, I did some dig into the implementation to find the answer.multipleWatermarkPolicy configuration used globally
TAG : apache-spark
Date : October 11 2020, 04:00 AM , By : ber reb
How to set optimal config values - trigger time, maxOffsetsPerTrigger - for Spark Structured Streaming while reading mes
How to set optimal config values - trigger time, maxOffsetsPerTrigger - for Spark Structured Streaming while reading mes
wish of those help You can run the spark structured streaming application in either fixed interval micro-batches or continuous. Here are some of the options you can use for tuning streaming applications.Kafka Configurations:
TAG : apache-spark
Date : October 11 2020, 02:00 AM , By : will liu
Use GCS staging directory for Spark jobs (on Dataproc)
Use GCS staging directory for Spark jobs (on Dataproc)
To fix the issue you can do First, it's important to realize that the staging directory is primarily used for staging artifacts for executors (primarily jars and other archives) rather than for storing intermediate data as a job executes. If you want
TAG : apache-spark
Date : October 10 2020, 06:00 PM , By : Benjamin
Does Hive preserve file order when selecting data
Does Hive preserve file order when selecting data
Hope that helps Without ORDER BY the order is not guaranteed. Data is being read in parallel by many processes (mappers), after splits were calculated, each process starts reading some piece of file or few files, depending on splits calculated.
TAG : apache-spark
Date : October 10 2020, 06:00 PM , By : Faketales
Required executor memory is above the max threshold of this cluster
Required executor memory is above the max threshold of this cluster
I hope this helps you . Executor memory is only the heap portion of the memory. You still have to run a JVM plus allocate the non-heap portion of memory inside a container and have that fit in YARN. Refer to the image from How-to: Tune Your Apache Sp
TAG : apache-spark
Date : October 10 2020, 05:00 PM , By : keyur
Invalid status code '400' from .. error payload: "requirement failed: Session isn't active
Invalid status code '400' from .. error payload: "requirement failed: Session isn't active
With these it helps Judging by the output, if your application is not finishing with a FAILED status, that sounds like a Livy timeout error: your application is likely taking longer than the defined timeout for a Livy session (which defaults to 1h),
TAG : apache-spark
Date : October 10 2020, 04:00 PM , By : Alexndar Jenny
Find number of partitions computed per machine in Apache Spark
Find number of partitions computed per machine in Apache Spark
I hope this helps . I am not sure about the Spark UI, but here is how you can achieve it programmatically -
TAG : apache-spark
Date : October 09 2020, 10:00 PM , By : devslashnull
Get field values from a structtype in pyspark dataframe
Get field values from a structtype in pyspark dataframe
it should still fix some issue IIUC, you can loop over the values in df2.schema.fields and get the name and dataType:
TAG : apache-spark
Date : October 09 2020, 05:00 PM , By : deepak patel
How to perform Unit testing on Spark Structured Streaming?
How to perform Unit testing on Spark Structured Streaming?
will be helpful for those in need tl;dr Use MemoryStream to add events and memory sink for the output.The following code should help to get started:
TAG : apache-spark
Date : October 09 2020, 08:00 AM , By : Reema
PySpark: ModuleNotFoundError: No module named 'app'
PySpark: ModuleNotFoundError: No module named 'app'
I wish this help you The error is very clear, there is not the module 'app'. Your Python code runs on driver, but you udf runs on executor PVM. When you call the udf, spark serializes the create_emi_amount to sent it to the executors.So, somewhere in
TAG : apache-spark
Date : October 09 2020, 07:00 AM , By : Berkay Yaman
Spark reuse broadcast DF
Spark reuse broadcast DF
I hope this helps you . Ok, updating the question.Summarising: INSIDE the same action, left_semis will reuse broadcasts while normal/left joins won't. Not sure related with the fact that Spark/developers already know the columns of that DF won't affe
TAG : apache-spark
Date : October 09 2020, 03:00 AM , By : Steven Guillemet
is it possible in spark to read large s3 csv files in parallel?
is it possible in spark to read large s3 csv files in parallel?
Any of those help Typically spark files are saved in multiple parts, allowing each worker to read different files. is there a similar solution when working on a single files? s3 provides the select API that should allow this kind of behaviour. , S3 S
TAG : apache-spark
Date : October 09 2020, 03:00 AM , By : user6039496
Which open source framework is best for ETL Apache Airflow or Apache Beam?
Which open source framework is best for ETL Apache Airflow or Apache Beam?
hope this fix your issue Apache Airflow is not a ETL framework, it is schedule and monitor workflows application which will schedule and monitor your ETL pipeline. Apache Beam is a unified model for defining data processing workflows.That mean your E
TAG : apache-spark
Date : October 08 2020, 11:00 PM , By : Augun
Spark SubQuery scan whole partition
Spark SubQuery scan whole partition
around this issue If I were you... I'd prefer different approach rather than sql query and full table scan.
TAG : apache-spark
Date : October 08 2020, 04:00 PM , By : user6041094
Spark S3Guard - Skip listing S3
Spark S3Guard - Skip listing S3
may help you . As of July 2019 it is still tagged as experimental; HADOOP-14936 lists the tasks there.The recent work has generally corner cases you aren't going to encounter on a daily basis, but which we know exist and can't ignore.
TAG : apache-spark
Date : October 08 2020, 01:00 PM , By : Bendol
How does count distinct work in Apache spark SQL
How does count distinct work in Apache spark SQL
help you fix your problem Count distinct works by hash-partitioning the data and then counting distinct elements by partition and finally summing the counts. In general it is a heavy operation due to the full shuffle and there is no silver bullet to
TAG : apache-spark
Date : October 08 2020, 05:00 AM , By : Sean Hogan
Running Spark in Memory-constrained Settings
Running Spark in Memory-constrained Settings
I think the issue was by ths following , Good data processing fundamentals will get you very far. Avoid dragging around columns you don't need. Use memory efficient types like int instead of strings. Avoid operations that require materializing large
TAG : apache-spark
Date : October 08 2020, 12:00 AM , By : Anju
bluedata pyspark hdfs write acces problem: hdfs_access_control_exception: permission denied
bluedata pyspark hdfs write acces problem: hdfs_access_control_exception: permission denied
seems to work fine With help from the BlueData support folks I could solve this problem! I got the information: "If ACL rules are not being applied, then it is possible the property dfs.namenode.acls.enabled is not set to true. Please change it to en
TAG : apache-spark
Date : October 07 2020, 05:00 PM , By : Amit
How we save a Huge pyspark dataframe?
How we save a Huge pyspark dataframe?
I wish this helpful for you I have a big pyspark Dataframe which I want to save It in myfile (.tsv) for further use. To do that I define the following code: , I'd suggest to use the Spark native write functionality:
TAG : apache-spark
Date : October 07 2020, 03:00 PM , By : s3vster
Spark: How to display the log4j loggers output to the console, when running in Cluster mode?
Spark: How to display the log4j loggers output to the console, when running in Cluster mode?
I wish did fix the issue. You can't print to console in cluster mode because the driver will likely never be on the same node that the application is launched. You will have to check logs in yarn/resource manager history.
TAG : apache-spark
Date : October 07 2020, 09:00 AM , By : Adom Frank
How to resolve this type of error "show is not a member "
How to resolve this type of error "show is not a member "
Hope that helps org.apache.spark.rdd.RDD[(Int, String)] dosen't have a method show() You have to change to dataframe as
TAG : apache-spark
Date : October 06 2020, 11:00 PM , By : Danielle Soulard
How do I add Hive support in Apache Spark?
How do I add Hive support in Apache Spark?
To fix this issue What environment are you running spark in? The easy answer is to let whatever packaging tool is available do all the heavy lifting. For example if you're on osx use brew to install everything. If you're in a maven/sbt project bring
TAG : apache-spark
Date : October 06 2020, 08:00 PM , By : Derek Chin
EMR does not detect all the memory
EMR does not detect all the memory
this will help From RM screen, click on every node's HTTP Address link to go to each Node Manager's Web UI. There, click on Tools > Configuration, and find yarn.nodemanager.resource.memory-mb setting. This should indicate how much memory is allocated
TAG : apache-spark
Date : October 06 2020, 12:00 PM , By : Xander Riga
Spark: cast bytearray to bigint
Spark: cast bytearray to bigint
fixed the issue. Will look into that further Use pyspark.sql.functions.hex and pyspark.sql.functions.conv:
TAG : apache-spark
Date : October 06 2020, 06:00 AM , By : Srinivas
Load Parquet file into HDFS table-Pyspark
Load Parquet file into HDFS table-Pyspark
Does that help Did you try to load data into hdfs://hadoop_data/path/mx_test/ this directory (as table pointed to this directory), then check you are able to see data in Hive table.
TAG : apache-spark
Date : October 06 2020, 02:00 AM , By : Alexander Gustafsson
Difference between Caching mechanism in Spark SQL
Difference between Caching mechanism in Spark SQL
may help you . In Spark SQL there is a difference in caching if you use directly SQL or you use the DataFrame DSL. Using the DSL, the caching is lazy so after calling
TAG : apache-spark
Date : October 05 2020, 11:00 PM , By : Vaanish Bisht
Cross account GCS access using Spark on Dataproc
Cross account GCS access using Spark on Dataproc
it fixes the issue To achieve this you need to re-configure GCS and BQ connectors to use different service accounts for authentication, by default both of them are using GCE VM service account.To do so, please, refer to the Method 2 in the GCS connec
TAG : apache-spark
Date : October 05 2020, 06:00 PM , By : Karthik Natarajan
Huge Multiline Json file is being processed by single Executor
Huge Multiline Json file is being processed by single Executor
This might help you Rule of thumb, Spark's parallelism is based on the number of input files. But you just specified only 1 file (MULTILINE_JSONFILE_.json), so Spark will use 1 cpu for processing following code
TAG : apache-spark
Date : October 05 2020, 04:00 AM , By : user6052816
How to use DataBricks dbutils jar outside notebook?
How to use DataBricks dbutils jar outside notebook?
This might help you DataBricks dbutils library needs to be used in eclipse or any other IDE. Methods like dbutils.secrets.get are not available from SecretUtil API outside notebook. In this scenario we can use com.databricks jar , This is the Maven R
TAG : apache-spark
Date : October 05 2020, 02:00 AM , By : Renat Gabdulhakov
GCP Dataproc : CPUs and Memory for Spark Job
GCP Dataproc : CPUs and Memory for Spark Job
hop of those help? Usually you don't, resources of a Dataproc cluster are managed by YARN, Spark jobs are automatically configured to make use of them. In particular, Spark dynamic allocation is enabled by default. But your application code still mat
TAG : apache-spark
Date : October 04 2020, 08:00 PM , By : DevBoss
How does Structured Streaming ensure exactly-once writing semantics for file sinks?
How does Structured Streaming ensure exactly-once writing semantics for file sinks?
it should still fix some issue Micro-Batch Stream Processing I assume that the question is about Micro-Batch Stream Processing (not Continuous Stream Processing).
TAG : apache-spark
Date : October 04 2020, 02:00 AM , By : حلا شندية
Underlying implementation of Group By clause in Spark SQL
Underlying implementation of Group By clause in Spark SQL
Hope that helps In Spark SQL, if you call groupBy(key).agg(...) with some aggregation function inside agg, the typical physical plan is HashAggregate -> Exchange -> HashAggregate. The first HashAggregate is responsible for doing partial aggregation (
TAG : apache-spark
Date : October 04 2020, 12:00 AM , By : Tobi Fischer
Batch Analysis on HDFS
Batch Analysis on HDFS
seems to work fine ,Some of these answers are subjective. YOu can think of what suits your needs best. These are just my observations or techniques i have used in the past
TAG : apache-spark
Date : October 03 2020, 09:00 PM , By : Hannah Dixon
Spark createDataFrame(df.rdd, df.schema) vs checkPoint for breaking lineage
Spark createDataFrame(df.rdd, df.schema) vs checkPoint for breaking lineage
I wish did fix the issue. I'm currently using , Let me start with creating dataframe with below line :
TAG : apache-spark
Date : October 03 2020, 07:00 PM , By : David Thomas
Structured streaming multiple watermarks
Structured streaming multiple watermarks
wish of those help From the Policy for handling multiple watermarks:
TAG : apache-spark
Date : October 03 2020, 04:00 PM , By : usarneme
left join on a key if there is no match then join on a different right key to get value
left join on a key if there is no match then join on a different right key to get value
wish helps you I have two spark dataframes, say df_core df_dict: , Seems like in left_outer operation:
TAG : apache-spark
Date : October 03 2020, 10:00 AM , By : FlavourAds
How to sort data in a cell of a dataframe?
How to sort data in a cell of a dataframe?
Does that help Is there a way to sort data inside a cell of a dataframe ? For example, I have a dataframe which contains two columns, colA colB with data as following: , You could achieve it by using udf
TAG : apache-spark
Date : October 02 2020, 08:00 PM , By : derJake
AnalysisException: Table or view not found --- Even after I create a view using "createGlobalTempView" , how t
AnalysisException: Table or view not found --- Even after I create a view using "createGlobalTempView" , how t
To fix the issue you can do These global views live in the database with the name global_temp so i would recommend to reference the tables in your queries as global_temp.table_name. I am not sure if it solves your problem, but you can try it.From the
TAG : apache-spark
Date : October 02 2020, 04:00 PM , By : Madhavan K R
Cassandra count query throwing ReadFailureException
Cassandra count query throwing ReadFailureException
hop of those help? That error stack is a read timeout to the nodes. This could actually be due to a number of reasons. Rather than answer this particular error, I'm going to answer in the context of what your end goal is here. You are trying to count
TAG : apache-spark
Date : October 02 2020, 09:00 AM , By : Outlook REST

shadow
Privacy Policy - Terms - Contact Us © 35dp-dentalpractice.co.uk