A few days ago I conducted a little experiment on Spark’s RDD operations. One of them was foreach operation (included as an action). Simply, this operation is applied to each rows in the RDD and the kind of operation applied is specified via a certain function. Here’s a simple example:
According to the code base, the driver status tracking feature is only implemented for standalone cluster manager. However, based on this reference, we could also poll the driver status for mesos and kubernetes (cluster deploy mode). Additionally, such a feature is also possible for YARN.
A few days ago I did a small experiment with Airflow. To be precise, scheduling Airflow to run a Spark job via spark-submit to a standalone cluster. I have actually mentioned briefly about how to create a DAG and Operators in the previous post.
Airflow is basically a workflow management system. When we’re talking about “workflow”, we’re referring to a sequence of tasks that needs to be performed to accomplish a certain goal. A simple example would be related to an ordinary ETL job, such as fetching data from data sources, transforming the data into certain formats which in accordance with the requirements, and then storing the transformed data to a data warehouse.
Woe & information value (IV) are used as a framework for attribute relevance analysis. WoE and IV can be utilised independently since each of them play different roles.
Kerberos is simply a “ticket-based” authentication protocol. It enhances the security approach used by password-based authentication protocol. Since there might be a possibility for tappers to take over the password, Kerberos mitigates this by leveraging a ticket (how it is generated is explained below) that ideally should only be known by the client and the service.
Have you ever heard of imblearn package? Based on its name, I think people who are familiar with machine learning are going to presume that it’s a package specifically created for tackling the problem of imbalanced data. If you do a deeper search, you’re gonna find its GitHub repository here. And yes, once again, it’s a Python package for playing with imbalanced data.
I came across a research paper related to balanced random forest for imbalanced data. For the sake of clarity, the following is the algorithm of BRF taken from the paper:
A few days back I tried to submit a Spark job to a Livy server deployed via local mode. The procedure was straightforward since the only thing to do was to specify the job file along with the configuration parameters (like what we do when using spark-submit directly).
One of the techniques in hyperparameter tuning is called Bayesian Optimization. It selects the next hyperparameter to evaluate based on the previous trials.
In the previous post, I wrote about how to perform pandas groupBy operation on a large dataset in streaming way. The main problem being addressed is optimum memory consumption since the data size might be extremely large.
I came across an article about how to perform groupBy operation for large dataset. Long story short, the author proposes an approach called streaming groupBy where the dataset is divided into chunks and the groupBy operation is applied to each chunk. This approach is implemented with pandas.
H2O provides a platform for building machine learning models in a scalable way. By focusing on scalability, it leverages the concept of cluster computing and therefore enables engineers to perform big data analytics.
In BigQuery, an external data source is a data source that we can query directly although the data is not stored in BigQuery’s storage. We can query the data source just by creating an external table that refers to the data source instead of loading it to BigQuery.
In the earlier section we have learnt a bit about buffer overflow technique. The primary concept is flooding the stack frame with input exceeding the buffer limit so that we can manipulate any datas saved on the stack frame. Some things that can be done using this technique are change the return address so that the attackers can call any functions they want, change the content of variables so that the function executes corresponding code, or change the return value of a function.
According to the code base, the driver status tracking feature is only implemented for standalone cluster manager. However, based on this reference, we could also poll the driver status for mesos and kubernetes (cluster deploy mode). Additionally, such a feature is also possible for YARN.
I came across an odd use case when applying F.col() on certain dataframe operations on PySpark v.2.4.0. Please note that the context of this issue is on Oct 6, 2019. Such an issue might have been solved in the future.
In the earlier section we have learnt a bit about buffer overflow technique. The primary concept is flooding the stack frame with input exceeding the buffer limit so that we can manipulate any datas saved on the stack frame. Some things that can be done using this technique are change the return address so that the attackers can call any functions they want, change the content of variables so that the function executes corresponding code, or change the return value of a function.
In the previous post, we discuss about the implementation of Kalman filter for static state (the true value of the object’s states are constant over time). In addition, the Kalman filter algorithm is applied to estimate single true value.
Kalman filter is an iterative mathematical process applied on consecutive data inputs to quickly estimate the true value (position, velocity, weight, temperature, etc) of the object being measured, when the measured values contain random error or uncertainty.
In the previous post I mentioned about a simple way of estimating the density ratio of two probability distributions. I decided to create a python package that provides such a functionality.
In the previous post I shared about how to detect covariate shift with a simple technique–model based approach. After knowing that the data distribution changes, what can we do to address such an issue?
Covariate shift happens when the distribution of train data differs with the distribution of test data. Take a look at the following probability equation.
I encountered an issue when applying crosstab function in PySpark to a pretty big data. And I think this should be considered as a pretty big issue. Please note that the context of this issue is on Sep 20, 2019. Such an issue might have been solved in the future.
Code profiling is simply used to assess the code performance, including its functions and sub-functions within functions. One of its obvious usage is code optimisation where a developer wants to improve the code efficiency by searching for the bottlenecks in the code.
A few days back I was exploring a big data quality tool called Griffin. There are lots of DQ tools out there, such as Deequ, Target’s data validator, Tensorflow data validator, PySpark Owl, and Great Expectation. There’s another one called Cerberus. It doesn’t natively support large-scale data however.
In the previous post, I mentioned about the general formula of the H statistic is the following (Source: Wikipedia - Kruskal–Wallis one-way analysis of variance):
The Kruskal-Wallis test is a non-parametric statistical test that is used to evaluate whether the medians of two or more groups are different. Since the test is non-parametric, it doesn’t assume that the data comes from a particular distribution.
A few days ago I did a little exploration on Spark’s groupBy behavior. Precisely, I wanted to see whether the order of the data was still preserved when applying groupBy on a repartitioned dataframe.
Recently I was exploring ways of adding a unique row ID column to a dataframe. The requirement is simple: “the row ID should strictly increase with difference of one and the data order is not modified”.
A few days ago I came across a case where I needed to define a dataframe’s column name with a special character, that is a dot (‘.’). Take a look at thee following schema example.
I came across an odd use case when applying F.col() on certain dataframe operations on PySpark v.2.4.0. Please note that the context of this issue is on Oct 6, 2019. Such an issue might have been solved in the future.
I encountered an issue when applying crosstab function in PySpark to a pretty big data. And I think this should be considered as a pretty big issue. Please note that the context of this issue is on Sep 20, 2019. Such an issue might have been solved in the future.
If you read my previous article titled Union Operation After Left-anti Join Might Result in Inconsistent Attributes Data, it was shown that the attributes data was inconsistent when combining two data frames after inner-join. According to the article, the solution is really simple. We just need to reorder the attributes order by using select command. Here’s a simple example.
Unioning two dataframes after joining them with left_anti? Well, seems like a straightforward approach. However, recently I encountered a case where join operation might shift the location of the join key in the resulting dataframe. This, unfortunately, makes the dataframe’s merging result inconsistent in terms of the data in each attribute.
I encountered an intriguing result when joining a dataframe with itself (self-join). As you might have already known, one of the problems occurred when doing a self-join relates to duplicated column names. Because of this duplication, there’s an ambiguity when we do operations requiring us to provide the column names.
Have you ever wondered how the size of a dataframe can be discovered? Perhaps it sounds not so fancy thing to know, yet I think there are certain cases requiring us to have pre-knowledge of the size of our dataframe. One of them is when we want to apply broadcast operation. As you might’ve already knownn, broadcasting requires the dataframe to be small enough to fit in memory in each executor. This implicitly means that we should know about the size of the dataframe beforehand in order for broadcasting to be applied successfully. Just FYI, broadcasting enables us to configure the maximum size of a dataframe that can be pushed into each executor. Precisely, this maximum size can be configured via spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, MAX_SIZE).
In Spark, data shuffling simply means data movement. In a single machine with multiple partitions, data shuffling means that data move from one partition to another partition. Meanwhile, in multiple machines, data shuffling can have two kinds of work. The first one is data move from one partition (A) to another partition (B) within the same machine (M1), while the second one is data move from partition B to another partition (C) within different machine (M2). Data in partition C might be moved to another partition within different machine again (M3).
One of the characteristics of Spark that makes me interested to explore this framework further is its lazy evaluation approach. Simply put, Spark won’t execute the transformation until an action is called. I think it’s logical since when we only specify the transformation plan and don’t ask it to execute the plan, why it needs to force itself to do the computation on the data? In addition, by implementing this lazy evaluation approach, Spark might be able to optimize the logical plan. The task of making the query to be more efficient manually might be reduced significantly. Cool, right?
A few days ago I did a little exploration on Spark’s groupBy behavior. Precisely, I wanted to see whether the order of the data was still preserved when applying groupBy on a repartitioned dataframe.
A statement I encountered a few days ago: “Avoid to use Resilient Distributed Datasets (RDDs) and use Dataframes/Datasets (DFs/DTs) instead, especially in production stage”.
I was implementing a paper related to balanced random forest (BRF). Just FYI, a BRF consists of some decision trees where each tree receives instances with a ratio of 1:1 for minority and majority class. A BRF also uses m features selected randomly to determine the best split.
Recently I watched a YouTube video about the infinite hotel paradox which was introduced in 1920s by a German mathematician, David Hilbert. In case you’re curious about he video, just search on YouTube using “The Infinite Hotel Paradox” keyword.
I encountered an intriguing result when joining a dataframe with itself (self-join). As you might have already known, one of the problems occurred when doing a self-join relates to duplicated column names. Because of this duplication, there’s an ambiguity when we do operations requiring us to provide the column names.
In the previous post I shared about how to detect covariate shift with a simple technique–model based approach. After knowing that the data distribution changes, what can we do to address such an issue?
In the previous post I mentioned about a simple way of estimating the density ratio of two probability distributions. I decided to create a python package that provides such a functionality.
The initial question that popped up in my mind was how to make LIME performs faster. This should be useful enough when the data to explain is big enough.
Spark functions (UDFs) are simply functions created to overcome speed performance problem when you want to process a dataframe. It’d be useful when your Python functions were so slow in processing a dataframe in large scale. When you use a Python function, it will process the dataframe with one-row-at-a-time manner, meaning that the process would be executed sequentially. Meanwhile, if you use a Spark UDF, Spark will distribute the dataframe and the Spark UDF to the provided executors. Hence, the dataframe processing would be executed in parallel. For more information about Spark UDF, please take a look at this post.
If in the probability context we state that P(x1, x2, ..., xn | params) means the probability of getting a set of observations x1, x2, …, and xn given the distribution parameters, then in the likelihood context we get the following.
A few days ago I came across a case where I needed to define a dataframe’s column name with a special character, that is a dot (‘.’). Take a look at thee following schema example.
There are several ways of removing duplicate rows in Spark. Two of them are by using distinct() and dropDuplicates(). The former lets us to remove rows with the same values on all the columns. Meanwhile, the latter lets us to remove rows with the same values on multiple selected columns.
In the previous post, we discuss about the implementation of Kalman filter for static state (the true value of the object’s states are constant over time). In addition, the Kalman filter algorithm is applied to estimate single true value.
To me, prime numbers are really interesting in terms of their position as the building blocks of other numbers. According to the Fundamental Theorem of Arithmetic, every positive integer N can be written as a product of P1, P2, P3, …, and Pk where Pi are all prime numbers.
A few days ago I did a small experiment with Airflow. To be precise, scheduling Airflow to run a Spark job via spark-submit to a standalone cluster. I have actually mentioned briefly about how to create a DAG and Operators in the previous post.
In BigQuery, an external data source is a data source that we can query directly although the data is not stored in BigQuery’s storage. We can query the data source just by creating an external table that refers to the data source instead of loading it to BigQuery.
In BigQuery, an external data source is a data source that we can query directly although the data is not stored in BigQuery’s storage. We can query the data source just by creating an external table that refers to the data source instead of loading it to BigQuery.
A few days back I was exploring a big data quality tool called Griffin. There are lots of DQ tools out there, such as Deequ, Target’s data validator, Tensorflow data validator, PySpark Owl, and Great Expectation. There’s another one called Cerberus. It doesn’t natively support large-scale data however.
There are several ways of removing duplicate rows in Spark. Two of them are by using distinct() and dropDuplicates(). The former lets us to remove rows with the same values on all the columns. Meanwhile, the latter lets us to remove rows with the same values on multiple selected columns.
I came across an article about how to perform groupBy operation for large dataset. Long story short, the author proposes an approach called streaming groupBy where the dataset is divided into chunks and the groupBy operation is applied to each chunk. This approach is implemented with pandas.
I encountered an issue when applying crosstab function in PySpark to a pretty big data. And I think this should be considered as a pretty big issue. Please note that the context of this issue is on Sep 20, 2019. Such an issue might have been solved in the future.
A few days ago I did a little exploration on Spark’s groupBy behavior. Precisely, I wanted to see whether the order of the data was still preserved when applying groupBy on a repartitioned dataframe.
H2O provides a platform for building machine learning models in a scalable way. By focusing on scalability, it leverages the concept of cluster computing and therefore enables engineers to perform big data analytics.
Basically, Metabase’s SparkSQL only allows users to access data in the Hive warehouse. In other words, the data must be in Hive table format to be able to be loaded.
Application monitoring is critically important, especially when we encounter performance issues. In Spark, one way to monitor a Spark application is via Spark UI. The problem is, this Spark UI can only be accessed when the application is running.
Basically, Metabase’s SparkSQL only allows users to access data in the Hive warehouse. In other words, the data must be in Hive table format to be able to be loaded.
One of the techniques in hyperparameter tuning is called Bayesian Optimization. It selects the next hyperparameter to evaluate based on the previous trials.
In the previous post, I mentioned about the general formula of the H statistic is the following (Source: Wikipedia - Kruskal–Wallis one-way analysis of variance):
The Kruskal-Wallis test is a non-parametric statistical test that is used to evaluate whether the medians of two or more groups are different. Since the test is non-parametric, it doesn’t assume that the data comes from a particular distribution.
It’s quite bothering when reading a publication that only provides a “statistically significant” result without telling much about the analysis prior to conducting the experiment.
Have you ever heard of imblearn package? Based on its name, I think people who are familiar with machine learning are going to presume that it’s a package specifically created for tackling the problem of imbalanced data. If you do a deeper search, you’re gonna find its GitHub repository here. And yes, once again, it’s a Python package for playing with imbalanced data.
In the previous post I shared about how to detect covariate shift with a simple technique–model based approach. After knowing that the data distribution changes, what can we do to address such an issue?
In the previous post, I wrote about how to perform pandas groupBy operation on a large dataset in streaming way. The main problem being addressed is optimum memory consumption since the data size might be extremely large.
Recently I watched a YouTube video about the infinite hotel paradox which was introduced in 1920s by a German mathematician, David Hilbert. In case you’re curious about he video, just search on YouTube using “The Infinite Hotel Paradox” keyword.
To me, prime numbers are really interesting in terms of their position as the building blocks of other numbers. According to the Fundamental Theorem of Arithmetic, every positive integer N can be written as a product of P1, P2, P3, …, and Pk where Pi are all prime numbers.
Recently I watched a YouTube video about the infinite hotel paradox which was introduced in 1920s by a German mathematician, David Hilbert. In case you’re curious about he video, just search on YouTube using “The Infinite Hotel Paradox” keyword.
Woe & information value (IV) are used as a framework for attribute relevance analysis. WoE and IV can be utilised independently since each of them play different roles.
Recently I played with a simple Spark Streaming application. Precisely, I investigated the behavior of repartitioning on different level of input data streams. For instance, we have two input data streams, such as linesDStream and wordsDStream. The question is, is the repartitioning result different if I repartition after linesDStream and after wordsDStream?
In the previous article about Kafka Consumer Awareness of New Topic Partitions, I wrote about partitions balancing by Kafka consumers. In other words, I’d like to see whether Kafka consumers are aware of new topic partitions.
Basically, you can presume Kafka as a messaging system. When an application sends a message to another application, one thing they need to do is to specify how to send the message. The most obvious use case in using a messaging system, in my opinion, is when we’re dealing with big data. For instance, a sender application shares a large amount of data that need to be processed by a receiver application. However, the processing rate by the receiver is lower than the sending rate. Consequently, the receiver might be overloaded since it’s unable to receive messages anymore while the processing is running. Although we’re using distributed receivers, we still have to tell the sender about which receiver node it should send the message to.
In the previous post, we discuss about the implementation of Kalman filter for static state (the true value of the object’s states are constant over time). In addition, the Kalman filter algorithm is applied to estimate single true value.
Kalman filter is an iterative mathematical process applied on consecutive data inputs to quickly estimate the true value (position, velocity, weight, temperature, etc) of the object being measured, when the measured values contain random error or uncertainty.
Kerberos is simply a “ticket-based” authentication protocol. It enhances the security approach used by password-based authentication protocol. Since there might be a possibility for tappers to take over the password, Kerberos mitigates this by leveraging a ticket (how it is generated is explained below) that ideally should only be known by the client and the service.
In the previous post, I mentioned about the general formula of the H statistic is the following (Source: Wikipedia - Kruskal–Wallis one-way analysis of variance):
The Kruskal-Wallis test is a non-parametric statistical test that is used to evaluate whether the medians of two or more groups are different. Since the test is non-parametric, it doesn’t assume that the data comes from a particular distribution.
One of the characteristics of Spark that makes me interested to explore this framework further is its lazy evaluation approach. Simply put, Spark won’t execute the transformation until an action is called. I think it’s logical since when we only specify the transformation plan and don’t ask it to execute the plan, why it needs to force itself to do the computation on the data? In addition, by implementing this lazy evaluation approach, Spark might be able to optimize the logical plan. The task of making the query to be more efficient manually might be reduced significantly. Cool, right?
Unioning two dataframes after joining them with left_anti? Well, seems like a straightforward approach. However, recently I encountered a case where join operation might shift the location of the join key in the resulting dataframe. This, unfortunately, makes the dataframe’s merging result inconsistent in terms of the data in each attribute.
The initial question that popped up in my mind was how to make LIME performs faster. This should be useful enough when the data to explain is big enough.
A few days back I tried to submit a Spark job to a Livy server deployed via local mode. The procedure was straightforward since the only thing to do was to specify the job file along with the configuration parameters (like what we do when using spark-submit directly).
One of the techniques in hyperparameter tuning is called Bayesian Optimization. It selects the next hyperparameter to evaluate based on the previous trials.
A few days back I was exploring a big data quality tool called Griffin. There are lots of DQ tools out there, such as Deequ, Target’s data validator, Tensorflow data validator, PySpark Owl, and Great Expectation. There’s another one called Cerberus. It doesn’t natively support large-scale data however.
Suppose we conduct K experiments on a kind of measurement. On each experiment, we take N observations. In other words, we’ll have N * K data at the end.
I was experimenting with the weight of evidence (WoE) encoding for continuous data. The preparation is quite different from categorical data in terms of binning characteristics.
In simple terms, we could define collinearity as a condition where two variables are highly correlated (positively / negatively). When there are more than two variables, it’s sometimes referred as multicollinearity.
Woe & information value (IV) are used as a framework for attribute relevance analysis. WoE and IV can be utilised independently since each of them play different roles.
In the previous post I mentioned about a simple way of estimating the density ratio of two probability distributions. I decided to create a python package that provides such a functionality.
In the previous post I shared about how to detect covariate shift with a simple technique–model based approach. After knowing that the data distribution changes, what can we do to address such an issue?
Covariate shift happens when the distribution of train data differs with the distribution of test data. Take a look at the following probability equation.
H2O provides a platform for building machine learning models in a scalable way. By focusing on scalability, it leverages the concept of cluster computing and therefore enables engineers to perform big data analytics.
Have you ever heard of imblearn package? Based on its name, I think people who are familiar with machine learning are going to presume that it’s a package specifically created for tackling the problem of imbalanced data. If you do a deeper search, you’re gonna find its GitHub repository here. And yes, once again, it’s a Python package for playing with imbalanced data.
I was implementing a paper related to balanced random forest (BRF). Just FYI, a BRF consists of some decision trees where each tree receives instances with a ratio of 1:1 for minority and majority class. A BRF also uses m features selected randomly to determine the best split.
I came across a research paper related to balanced random forest for imbalanced data. For the sake of clarity, the following is the algorithm of BRF taken from the paper:
There might be a case where we need to perform a certain operation on each data partition. One of the most common examples is the use of mapPartitions. Sometimes, such an operation probably requires a more complicated procedure. This, in the end, makes the method executing the operation needs more than one parameter.
Recently I watched a YouTube video about the infinite hotel paradox which was introduced in 1920s by a German mathematician, David Hilbert. In case you’re curious about he video, just search on YouTube using “The Infinite Hotel Paradox” keyword.
In simple terms, we could define collinearity as a condition where two variables are highly correlated (positively / negatively). When there are more than two variables, it’s sometimes referred as multicollinearity.
Yesterday I came across an interesting Math paper discussing about the Riemann hypothesis. Regarding the concept itself, there’s lots of maths but I think I enjoyed the reading. Frankly speaking, although mathematics is one of my favourite subjects, I’ve been rarely playing with it (esp. pure maths) since I got acquainted with AI and big data engineering world. Now I think it’s just fine to play with it again. Just for fun.
If in the probability context we state that P(x1, x2, ..., xn | params) means the probability of getting a set of observations x1, x2, …, and xn given the distribution parameters, then in the likelihood context we get the following.
Basically, you can presume Kafka as a messaging system. When an application sends a message to another application, one thing they need to do is to specify how to send the message. The most obvious use case in using a messaging system, in my opinion, is when we’re dealing with big data. For instance, a sender application shares a large amount of data that need to be processed by a receiver application. However, the processing rate by the receiver is lower than the sending rate. Consequently, the receiver might be overloaded since it’s unable to receive messages anymore while the processing is running. Although we’re using distributed receivers, we still have to tell the sender about which receiver node it should send the message to.
Basically, Metabase’s SparkSQL only allows users to access data in the Hive warehouse. In other words, the data must be in Hive table format to be able to be loaded.
In the previous article about Kafka Consumer Awareness of New Topic Partitions, I wrote about partitions balancing by Kafka consumers. In other words, I’d like to see whether Kafka consumers are aware of new topic partitions.
If in the probability context we state that P(x1, x2, ..., xn | params) means the probability of getting a set of observations x1, x2, …, and xn given the distribution parameters, then in the likelihood context we get the following.
Application monitoring is critically important, especially when we encounter performance issues. In Spark, one way to monitor a Spark application is via Spark UI. The problem is, this Spark UI can only be accessed when the application is running.
I was experimenting with the weight of evidence (WoE) encoding for continuous data. The preparation is quite different from categorical data in terms of binning characteristics.
In simple terms, we could define collinearity as a condition where two variables are highly correlated (positively / negatively). When there are more than two variables, it’s sometimes referred as multicollinearity.
In the previous post, we discuss about the implementation of Kalman filter for static state (the true value of the object’s states are constant over time). In addition, the Kalman filter algorithm is applied to estimate single true value.
Whenever we call dataframe.writeStream.start() in structured streaming, Spark creates a new stream that reads from a data source (specified by dataframe.readStream). The data passed through the stream is then processed (if needed) and sinked to a certain location.
Basically, Metabase’s SparkSQL only allows users to access data in the Hive warehouse. In other words, the data must be in Hive table format to be able to be loaded.
In the previous post, I mentioned about the general formula of the H statistic is the following (Source: Wikipedia - Kruskal–Wallis one-way analysis of variance):
The Kruskal-Wallis test is a non-parametric statistical test that is used to evaluate whether the medians of two or more groups are different. Since the test is non-parametric, it doesn’t assume that the data comes from a particular distribution.
Basically, code obfuscation is a technique used to modify the source code so that it becomes difficult to understand but remains fully functional. The main objective is to protect intellectual properties and prevent hackers from reverse engineering a proprietary source code.
I came across an article about how to perform groupBy operation for large dataset. Long story short, the author proposes an approach called streaming groupBy where the dataset is divided into chunks and the groupBy operation is applied to each chunk. This approach is implemented with pandas.
Spark functions (UDFs) are simply functions created to overcome speed performance problem when you want to process a dataframe. It’d be useful when your Python functions were so slow in processing a dataframe in large scale. When you use a Python function, it will process the dataframe with one-row-at-a-time manner, meaning that the process would be executed sequentially. Meanwhile, if you use a Spark UDF, Spark will distribute the dataframe and the Spark UDF to the provided executors. Hence, the dataframe processing would be executed in parallel. For more information about Spark UDF, please take a look at this post.
Recently I watched a YouTube video about the infinite hotel paradox which was introduced in 1920s by a German mathematician, David Hilbert. In case you’re curious about he video, just search on YouTube using “The Infinite Hotel Paradox” keyword.
Whenever we call dataframe.writeStream.start() in structured streaming, Spark creates a new stream that reads from a data source (specified by dataframe.readStream). The data passed through the stream is then processed (if needed) and sinked to a certain location.
I was curious about how checkpoint files in Spark structured streaming looked like. To introduce the basic concept, checkpointing simply denotes the progress information of streaming process. This checkpoint files are usually used for failure recovery. More detail explanation can be found here.
In the previous article about Kafka Consumer Awareness of New Topic Partitions, I wrote about partitions balancing by Kafka consumers. In other words, I’d like to see whether Kafka consumers are aware of new topic partitions.
In Spark, data shuffling simply means data movement. In a single machine with multiple partitions, data shuffling means that data move from one partition to another partition. Meanwhile, in multiple machines, data shuffling can have two kinds of work. The first one is data move from one partition (A) to another partition (B) within the same machine (M1), while the second one is data move from partition B to another partition (C) within different machine (M2). Data in partition C might be moved to another partition within different machine again (M3).
The problem is really simple. After equi-joining (inner) two dataframes, a certain operation is applied to each partition. Precisely, such an operation can be accomplished by the following code:
Application monitoring is critically important, especially when we encounter performance issues. In Spark, one way to monitor a Spark application is via Spark UI. The problem is, this Spark UI can only be accessed when the application is running.
I encountered an issue when applying crosstab function in PySpark to a pretty big data. And I think this should be considered as a pretty big issue. Please note that the context of this issue is on Sep 20, 2019. Such an issue might have been solved in the future.
Yesterday I came across an interesting Math paper discussing about the Riemann hypothesis. Regarding the concept itself, there’s lots of maths but I think I enjoyed the reading. Frankly speaking, although mathematics is one of my favourite subjects, I’ve been rarely playing with it (esp. pure maths) since I got acquainted with AI and big data engineering world. Now I think it’s just fine to play with it again. Just for fun.
In the previous post I mentioned about a simple way of estimating the density ratio of two probability distributions. I decided to create a python package that provides such a functionality.
In the previous post I shared about how to detect covariate shift with a simple technique–model based approach. After knowing that the data distribution changes, what can we do to address such an issue?
If in the probability context we state that P(x1, x2, ..., xn | params) means the probability of getting a set of observations x1, x2, …, and xn given the distribution parameters, then in the likelihood context we get the following.
In the previous article about Kafka Consumer Awareness of New Topic Partitions, I wrote about partitions balancing by Kafka consumers. In other words, I’d like to see whether Kafka consumers are aware of new topic partitions.
Code profiling is simply used to assess the code performance, including its functions and sub-functions within functions. One of its obvious usage is code optimisation where a developer wants to improve the code efficiency by searching for the bottlenecks in the code.
Basically, code obfuscation is a technique used to modify the source code so that it becomes difficult to understand but remains fully functional. The main objective is to protect intellectual properties and prevent hackers from reverse engineering a proprietary source code.
Recently I was exploring ways of adding a unique row ID column to a dataframe. The requirement is simple: “the row ID should strictly increase with difference of one and the data order is not modified”.
I came across an odd use case when applying F.col() on certain dataframe operations on PySpark v.2.4.0. Please note that the context of this issue is on Oct 6, 2019. Such an issue might have been solved in the future.
I encountered an issue when applying crosstab function in PySpark to a pretty big data. And I think this should be considered as a pretty big issue. Please note that the context of this issue is on Sep 20, 2019. Such an issue might have been solved in the future.
If you read my previous article titled Union Operation After Left-anti Join Might Result in Inconsistent Attributes Data, it was shown that the attributes data was inconsistent when combining two data frames after inner-join. According to the article, the solution is really simple. We just need to reorder the attributes order by using select command. Here’s a simple example.
Basically, code obfuscation is a technique used to modify the source code so that it becomes difficult to understand but remains fully functional. The main objective is to protect intellectual properties and prevent hackers from reverse engineering a proprietary source code.
Airflow is basically a workflow management system. When we’re talking about “workflow”, we’re referring to a sequence of tasks that needs to be performed to accomplish a certain goal. A simple example would be related to an ordinary ETL job, such as fetching data from data sources, transforming the data into certain formats which in accordance with the requirements, and then storing the transformed data to a data warehouse.
There might be a case where we need to perform a certain operation on each data partition. One of the most common examples is the use of mapPartitions. Sometimes, such an operation probably requires a more complicated procedure. This, in the end, makes the method executing the operation needs more than one parameter.
Code profiling is simply used to assess the code performance, including its functions and sub-functions within functions. One of its obvious usage is code optimisation where a developer wants to improve the code efficiency by searching for the bottlenecks in the code.
There might be a case where we need to perform a certain operation on each data partition. One of the most common examples is the use of mapPartitions. Sometimes, such an operation probably requires a more complicated procedure. This, in the end, makes the method executing the operation needs more than one parameter.
In Spark, data shuffling simply means data movement. In a single machine with multiple partitions, data shuffling means that data move from one partition to another partition. Meanwhile, in multiple machines, data shuffling can have two kinds of work. The first one is data move from one partition (A) to another partition (B) within the same machine (M1), while the second one is data move from partition B to another partition (C) within different machine (M2). Data in partition C might be moved to another partition within different machine again (M3).
A statement I encountered a few days ago: “Avoid to use Resilient Distributed Datasets (RDDs) and use Dataframes/Datasets (DFs/DTs) instead, especially in production stage”.
In simple terms, we could define collinearity as a condition where two variables are highly correlated (positively / negatively). When there are more than two variables, it’s sometimes referred as multicollinearity.
Recently I played with a simple Spark Streaming application. Precisely, I investigated the behavior of repartitioning on different level of input data streams. For instance, we have two input data streams, such as linesDStream and wordsDStream. The question is, is the repartitioning result different if I repartition after linesDStream and after wordsDStream?
A few days ago I did a little exploration on Spark’s groupBy behavior. Precisely, I wanted to see whether the order of the data was still preserved when applying groupBy on a repartitioned dataframe.
A statement I encountered a few days ago: “Avoid to use Resilient Distributed Datasets (RDDs) and use Dataframes/Datasets (DFs/DTs) instead, especially in production stage”.
Yesterday I came across an interesting Math paper discussing about the Riemann hypothesis. Regarding the concept itself, there’s lots of maths but I think I enjoyed the reading. Frankly speaking, although mathematics is one of my favourite subjects, I’ve been rarely playing with it (esp. pure maths) since I got acquainted with AI and big data engineering world. Now I think it’s just fine to play with it again. Just for fun.
It’s quite bothering when reading a publication that only provides a “statistically significant” result without telling much about the analysis prior to conducting the experiment.
There are several ways of removing duplicate rows in Spark. Two of them are by using distinct() and dropDuplicates(). The former lets us to remove rows with the same values on all the columns. Meanwhile, the latter lets us to remove rows with the same values on multiple selected columns.
A few days ago I did a small experiment with Airflow. To be precise, scheduling Airflow to run a Spark job via spark-submit to a standalone cluster. I have actually mentioned briefly about how to create a DAG and Operators in the previous post.
Kerberos is simply a “ticket-based” authentication protocol. It enhances the security approach used by password-based authentication protocol. Since there might be a possibility for tappers to take over the password, Kerberos mitigates this by leveraging a ticket (how it is generated is explained below) that ideally should only be known by the client and the service.
Basically, code obfuscation is a technique used to modify the source code so that it becomes difficult to understand but remains fully functional. The main objective is to protect intellectual properties and prevent hackers from reverse engineering a proprietary source code.
I encountered an intriguing result when joining a dataframe with itself (self-join). As you might have already known, one of the problems occurred when doing a self-join relates to duplicated column names. Because of this duplication, there’s an ambiguity when we do operations requiring us to provide the column names.
In Spark, data shuffling simply means data movement. In a single machine with multiple partitions, data shuffling means that data move from one partition to another partition. Meanwhile, in multiple machines, data shuffling can have two kinds of work. The first one is data move from one partition (A) to another partition (B) within the same machine (M1), while the second one is data move from partition B to another partition (C) within different machine (M2). Data in partition C might be moved to another partition within different machine again (M3).
Kalman filter is an iterative mathematical process applied on consecutive data inputs to quickly estimate the true value (position, velocity, weight, temperature, etc) of the object being measured, when the measured values contain random error or uncertainty.
Have you ever wondered how the size of a dataframe can be discovered? Perhaps it sounds not so fancy thing to know, yet I think there are certain cases requiring us to have pre-knowledge of the size of our dataframe. One of them is when we want to apply broadcast operation. As you might’ve already knownn, broadcasting requires the dataframe to be small enough to fit in memory in each executor. This implicitly means that we should know about the size of the dataframe beforehand in order for broadcasting to be applied successfully. Just FYI, broadcasting enables us to configure the maximum size of a dataframe that can be pushed into each executor. Precisely, this maximum size can be configured via spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, MAX_SIZE).
Have you ever wondered how the size of a dataframe can be discovered? Perhaps it sounds not so fancy thing to know, yet I think there are certain cases requiring us to have pre-knowledge of the size of our dataframe. One of them is when we want to apply broadcast operation. As you might’ve already knownn, broadcasting requires the dataframe to be small enough to fit in memory in each executor. This implicitly means that we should know about the size of the dataframe beforehand in order for broadcasting to be applied successfully. Just FYI, broadcasting enables us to configure the maximum size of a dataframe that can be pushed into each executor. Precisely, this maximum size can be configured via spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, MAX_SIZE).
There are several ways of removing duplicate rows in Spark. Two of them are by using distinct() and dropDuplicates(). The former lets us to remove rows with the same values on all the columns. Meanwhile, the latter lets us to remove rows with the same values on multiple selected columns.
Woe & information value (IV) are used as a framework for attribute relevance analysis. WoE and IV can be utilised independently since each of them play different roles.
The initial question that popped up in my mind was how to make LIME performs faster. This should be useful enough when the data to explain is big enough.
A few days back I tried to set up a spark standalone cluster in my own machine with the following specification: two workers (balanced cores) within a single node.
A few days back I tried to submit a Spark job to a Livy server deployed via local mode. The procedure was straightforward since the only thing to do was to specify the job file along with the configuration parameters (like what we do when using spark-submit directly).
A few days ago I came across a case where I needed to define a dataframe’s column name with a special character, that is a dot (‘.’). Take a look at thee following schema example.
According to the code base, the driver status tracking feature is only implemented for standalone cluster manager. However, based on this reference, we could also poll the driver status for mesos and kubernetes (cluster deploy mode). Additionally, such a feature is also possible for YARN.
Whenever we call dataframe.writeStream.start() in structured streaming, Spark creates a new stream that reads from a data source (specified by dataframe.readStream). The data passed through the stream is then processed (if needed) and sinked to a certain location.
Application monitoring is critically important, especially when we encounter performance issues. In Spark, one way to monitor a Spark application is via Spark UI. The problem is, this Spark UI can only be accessed when the application is running.
There might be a case where we need to perform a certain operation on each data partition. One of the most common examples is the use of mapPartitions. Sometimes, such an operation probably requires a more complicated procedure. This, in the end, makes the method executing the operation needs more than one parameter.
I was curious about how checkpoint files in Spark structured streaming looked like. To introduce the basic concept, checkpointing simply denotes the progress information of streaming process. This checkpoint files are usually used for failure recovery. More detail explanation can be found here.
Code profiling is simply used to assess the code performance, including its functions and sub-functions within functions. One of its obvious usage is code optimisation where a developer wants to improve the code efficiency by searching for the bottlenecks in the code.
Unioning two dataframes after joining them with left_anti? Well, seems like a straightforward approach. However, recently I encountered a case where join operation might shift the location of the join key in the resulting dataframe. This, unfortunately, makes the dataframe’s merging result inconsistent in terms of the data in each attribute.
I encountered an intriguing result when joining a dataframe with itself (self-join). As you might have already known, one of the problems occurred when doing a self-join relates to duplicated column names. Because of this duplication, there’s an ambiguity when we do operations requiring us to provide the column names.
Have you ever wondered how the size of a dataframe can be discovered? Perhaps it sounds not so fancy thing to know, yet I think there are certain cases requiring us to have pre-knowledge of the size of our dataframe. One of them is when we want to apply broadcast operation. As you might’ve already knownn, broadcasting requires the dataframe to be small enough to fit in memory in each executor. This implicitly means that we should know about the size of the dataframe beforehand in order for broadcasting to be applied successfully. Just FYI, broadcasting enables us to configure the maximum size of a dataframe that can be pushed into each executor. Precisely, this maximum size can be configured via spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, MAX_SIZE).
In Spark, data shuffling simply means data movement. In a single machine with multiple partitions, data shuffling means that data move from one partition to another partition. Meanwhile, in multiple machines, data shuffling can have two kinds of work. The first one is data move from one partition (A) to another partition (B) within the same machine (M1), while the second one is data move from partition B to another partition (C) within different machine (M2). Data in partition C might be moved to another partition within different machine again (M3).
One of the characteristics of Spark that makes me interested to explore this framework further is its lazy evaluation approach. Simply put, Spark won’t execute the transformation until an action is called. I think it’s logical since when we only specify the transformation plan and don’t ask it to execute the plan, why it needs to force itself to do the computation on the data? In addition, by implementing this lazy evaluation approach, Spark might be able to optimize the logical plan. The task of making the query to be more efficient manually might be reduced significantly. Cool, right?
Spark functions (UDFs) are simply functions created to overcome speed performance problem when you want to process a dataframe. It’d be useful when your Python functions were so slow in processing a dataframe in large scale. When you use a Python function, it will process the dataframe with one-row-at-a-time manner, meaning that the process would be executed sequentially. Meanwhile, if you use a Spark UDF, Spark will distribute the dataframe and the Spark UDF to the provided executors. Hence, the dataframe processing would be executed in parallel. For more information about Spark UDF, please take a look at this post.
A few days ago I did a little exploration on Spark’s groupBy behavior. Precisely, I wanted to see whether the order of the data was still preserved when applying groupBy on a repartitioned dataframe.
A statement I encountered a few days ago: “Avoid to use Resilient Distributed Datasets (RDDs) and use Dataframes/Datasets (DFs/DTs) instead, especially in production stage”.
I was implementing a paper related to balanced random forest (BRF). Just FYI, a BRF consists of some decision trees where each tree receives instances with a ratio of 1:1 for minority and majority class. A BRF also uses m features selected randomly to determine the best split.
A few days ago I conducted a little experiment on Spark’s RDD operations. One of them was foreach operation (included as an action). Simply, this operation is applied to each rows in the RDD and the kind of operation applied is specified via a certain function. Here’s a simple example:
Basically, Metabase’s SparkSQL only allows users to access data in the Hive warehouse. In other words, the data must be in Hive table format to be able to be loaded.
The problem is really simple. After equi-joining (inner) two dataframes, a certain operation is applied to each partition. Precisely, such an operation can be accomplished by the following code:
Recently I played with a simple Spark Streaming application. Precisely, I investigated the behavior of repartitioning on different level of input data streams. For instance, we have two input data streams, such as linesDStream and wordsDStream. The question is, is the repartitioning result different if I repartition after linesDStream and after wordsDStream?
Basically, Metabase’s SparkSQL only allows users to access data in the Hive warehouse. In other words, the data must be in Hive table format to be able to be loaded.
A few days back I tried to set up a spark standalone cluster in my own machine with the following specification: two workers (balanced cores) within a single node.
Suppose we conduct K experiments on a kind of measurement. On each experiment, we take N observations. In other words, we’ll have N * K data at the end.
Kalman filter is an iterative mathematical process applied on consecutive data inputs to quickly estimate the true value (position, velocity, weight, temperature, etc) of the object being measured, when the measured values contain random error or uncertainty.
Correlation between lags (lag > 0) is close to zero (each autocorrelation lies within the bound which shows no statistically significant difference from zero)
In the previous post, I mentioned about the general formula of the H statistic is the following (Source: Wikipedia - Kruskal–Wallis one-way analysis of variance):
The Kruskal-Wallis test is a non-parametric statistical test that is used to evaluate whether the medians of two or more groups are different. Since the test is non-parametric, it doesn’t assume that the data comes from a particular distribution.
In the previous post, we discuss about the implementation of Kalman filter for static state (the true value of the object’s states are constant over time). In addition, the Kalman filter algorithm is applied to estimate single true value.
Kalman filter is an iterative mathematical process applied on consecutive data inputs to quickly estimate the true value (position, velocity, weight, temperature, etc) of the object being measured, when the measured values contain random error or uncertainty.
It’s quite bothering when reading a publication that only provides a “statistically significant” result without telling much about the analysis prior to conducting the experiment.
If in the probability context we state that P(x1, x2, ..., xn | params) means the probability of getting a set of observations x1, x2, …, and xn given the distribution parameters, then in the likelihood context we get the following.
Suppose we conduct K experiments on a kind of measurement. On each experiment, we take N observations. In other words, we’ll have N * K data at the end.
Basically, you can presume Kafka as a messaging system. When an application sends a message to another application, one thing they need to do is to specify how to send the message. The most obvious use case in using a messaging system, in my opinion, is when we’re dealing with big data. For instance, a sender application shares a large amount of data that need to be processed by a receiver application. However, the processing rate by the receiver is lower than the sending rate. Consequently, the receiver might be overloaded since it’s unable to receive messages anymore while the processing is running. Although we’re using distributed receivers, we still have to tell the sender about which receiver node it should send the message to.
In the previous post, I wrote about how to perform pandas groupBy operation on a large dataset in streaming way. The main problem being addressed is optimum memory consumption since the data size might be extremely large.
I came across an article about how to perform groupBy operation for large dataset. Long story short, the author proposes an approach called streaming groupBy where the dataset is divided into chunks and the groupBy operation is applied to each chunk. This approach is implemented with pandas.
Whenever we call dataframe.writeStream.start() in structured streaming, Spark creates a new stream that reads from a data source (specified by dataframe.readStream). The data passed through the stream is then processed (if needed) and sinked to a certain location.
I was curious about how checkpoint files in Spark structured streaming looked like. To introduce the basic concept, checkpointing simply denotes the progress information of streaming process. This checkpoint files are usually used for failure recovery. More detail explanation can be found here.
A few days ago I did a small experiment with Airflow. To be precise, scheduling Airflow to run a Spark job via spark-submit to a standalone cluster. I have actually mentioned briefly about how to create a DAG and Operators in the previous post.
Correlation between lags (lag > 0) is close to zero (each autocorrelation lies within the bound which shows no statistically significant difference from zero)
One of the techniques in hyperparameter tuning is called Bayesian Optimization. It selects the next hyperparameter to evaluate based on the previous trials.
The initial question that popped up in my mind was how to make LIME performs faster. This should be useful enough when the data to explain is big enough.
In the previous post, we discuss about the implementation of Kalman filter for static state (the true value of the object’s states are constant over time). In addition, the Kalman filter algorithm is applied to estimate single true value.
Kalman filter is an iterative mathematical process applied on consecutive data inputs to quickly estimate the true value (position, velocity, weight, temperature, etc) of the object being measured, when the measured values contain random error or uncertainty.
Unioning two dataframes after joining them with left_anti? Well, seems like a straightforward approach. However, recently I encountered a case where join operation might shift the location of the join key in the resulting dataframe. This, unfortunately, makes the dataframe’s merging result inconsistent in terms of the data in each attribute.
If you read my previous article titled Union Operation After Left-anti Join Might Result in Inconsistent Attributes Data, it was shown that the attributes data was inconsistent when combining two data frames after inner-join. According to the article, the solution is really simple. We just need to reorder the attributes order by using select command. Here’s a simple example.
Recently I was exploring ways of adding a unique row ID column to a dataframe. The requirement is simple: “the row ID should strictly increase with difference of one and the data order is not modified”.
I was experimenting with the weight of evidence (WoE) encoding for continuous data. The preparation is quite different from categorical data in terms of binning characteristics.
Woe & information value (IV) are used as a framework for attribute relevance analysis. WoE and IV can be utilised independently since each of them play different roles.
Correlation between lags (lag > 0) is close to zero (each autocorrelation lies within the bound which shows no statistically significant difference from zero)
There are several ways of removing duplicate rows in Spark. Two of them are by using distinct() and dropDuplicates(). The former lets us to remove rows with the same values on all the columns. Meanwhile, the latter lets us to remove rows with the same values on multiple selected columns.
A few days back I tried to set up a spark standalone cluster in my own machine with the following specification: two workers (balanced cores) within a single node.
A few days ago I did a small experiment with Airflow. To be precise, scheduling Airflow to run a Spark job via spark-submit to a standalone cluster. I have actually mentioned briefly about how to create a DAG and Operators in the previous post.
Airflow is basically a workflow management system. When we’re talking about “workflow”, we’re referring to a sequence of tasks that needs to be performed to accomplish a certain goal. A simple example would be related to an ordinary ETL job, such as fetching data from data sources, transforming the data into certain formats which in accordance with the requirements, and then storing the transformed data to a data warehouse.
It’s quite bothering when reading a publication that only provides a “statistically significant” result without telling much about the analysis prior to conducting the experiment.
Yesterday I came across an interesting Math paper discussing about the Riemann hypothesis. Regarding the concept itself, there’s lots of maths but I think I enjoyed the reading. Frankly speaking, although mathematics is one of my favourite subjects, I’ve been rarely playing with it (esp. pure maths) since I got acquainted with AI and big data engineering world. Now I think it’s just fine to play with it again. Just for fun.