Posts by Tags

Spark Accumulator

1 minute read

Published: March 23, 2019

A few days ago I conducted a little experiment on Spark’s RDD operations. One of them was foreach operation (included as an action). Simply, this operation is applied to each rows in the RDD and the kind of operation applied is specified via a certain function. Here’s a simple example:

Finding the Best Threshold that Maximizes Accuracy from ROC & PR Curve

4 minute read

Published: December 23, 2019

The problem is simple. How to find the best threshold from an ROC and PR curve that maximise a certain binary classification metric?

Little Note on MySQL and Adminer

less than 1 minute read

Published: February 16, 2021

Just a little note about MySQL and Adminer.

Level 3. Ah_Choo!

10 minute read

Published: July 10, 2016

Primary purpose:

Vieta Triple Roots: 2019 American Invitational Mathematics Examination (AIME) I Problem 10

4 minute read

Published: June 07, 2020

Let’s take a look at the problem statement.

Airflow Feature Improvement: Spark Driver Status Polling Support for YARN, Mesos & K8S

1 minute read

Published: December 14, 2019

According to the code base, the driver status tracking feature is only implemented for standalone cluster manager. However, based on this reference, we could also poll the driver status for mesos and kubernetes (cluster deploy mode). Additionally, such a feature is also possible for YARN.

Bug on Airflow When Polling Spark Job Status Deployed with Cluster Mode

9 minute read

Published: December 09, 2019

I was thinking of the following case.

Airflow Executor & Friends: How Actually Does the Executor Run the Task?

7 minute read

Published: November 29, 2019

A few days ago I did a small experiment with Airflow. To be precise, scheduling Airflow to run a Spark job via spark-submit to a standalone cluster. I have actually mentioned briefly about how to create a DAG and Operators in the previous post.

Setting Up & Debugging Airflow On Local Machine

5 minute read

Published: November 22, 2019

Airflow is basically a workflow management system. When we’re talking about “workflow”, we’re referring to a sequence of tasks that needs to be performed to accomplish a certain goal. A simple example would be related to an ordinary ETL job, such as fetching data from data sources, transforming the data into certain formats which in accordance with the requirements, and then storing the transformed data to a data warehouse.

How to Solve This Extreme Algebra?

6 minute read

Published: June 20, 2020

Here we’re gonna look at how to solve the following algebra problem.

Python-like to Algorithm Specification

7 minute read

Published: December 07, 2017

An interesting paper: http://www.phontron.com/paper/oda15ase.pdf

IMO 2012 Problem 2 - Solution

4 minute read

Published: March 20, 2021

Let’s play with the 2nd problem of the International Mathematics Olympiad (IMO) 2012.

Apache Griffin for Data Validation: Yay & Nay

10 minute read

Published: May 16, 2020

In the previous post, I mentioned that there are several observed points regarding Griffin during my exploration.

Level 3. Ah_Choo!

10 minute read

Published: July 10, 2016

Primary purpose:

Level 2. Hi~

14 minute read

Published: July 09, 2016

Primary purpose:

Level 1. Shooting Star

8 minute read

Published: July 08, 2016

Primary purpose:

Level 0. Good Night Like Yesterday

8 minute read

Published: July 07, 2016

Primary purpose:

Stack Frame

5 minute read

Published: July 04, 2016

To discuss about this stack frame, we’ll see from Assembly language point of view.

What is Buffer Overflow?

1 minute read

Published: July 03, 2016

Buffer Overflow is one of code’s exploitation technique which uses buffer weakness. In addition, buffer is a block or space for saving datas.

Weight of Evidence & Information Value for Attributes Relevance Analysis with PySpark

1 minute read

Published: April 09, 2020

Woe & information value (IV) are used as a framework for attribute relevance analysis. WoE and IV can be utilised independently since each of them play different roles.

The Three-Headed Hound of the Underworld (Kerberos)

6 minute read

Published: January 14, 2020

Kerberos is simply a “ticket-based” authentication protocol. It enhances the security approach used by password-based authentication protocol. Since there might be a possibility for tappers to take over the password, Kerberos mitigates this by leveraging a ticket (how it is generated is explained below) that ideally should only be known by the client and the service.

Implementing Balanced Random Forest via imblearn

3 minute read

Published: April 12, 2019

Have you ever heard of imblearn package? Based on its name, I think people who are familiar with machine learning are going to presume that it’s a package specifically created for tackling the problem of imbalanced data. If you do a deeper search, you’re gonna find its GitHub repository here. And yes, once again, it’s a Python package for playing with imbalanced data.

Questions on Balanced Random Forest & Its Implementation

3 minute read

Published: March 15, 2019

I came across a research paper related to balanced random forest for imbalanced data. For the sake of clarity, the following is the algorithm of BRF taken from the paper:

Submitting and Polling Spark Job Status with Apache Livy

5 minute read

Published: January 09, 2020

Livy offers a REST interface that is used to interact with Spark cluster. It provides two general approaches for job submission and monitoring.

User Sessions Addition Error When Submitting Spark Job to Apache Livy via Local Mode

1 minute read

Published: January 09, 2020

A few days back I tried to submit a Spark job to a Livy server deployed via local mode. The procedure was straightforward since the only thing to do was to specify the job file along with the configuration parameters (like what we do when using spark-submit directly).

The Monty Hall Problem Using Conditional Probability

2 minute read

Published: August 03, 2019

The Monty Hall Problem can be stated as the following:

Tree Parzen Estimator in Bayesian Optimization for Hyperparameter Tuning

3 minute read

Published: December 03, 2020

One of the techniques in hyperparameter tuning is called Bayesian Optimization. It selects the next hyperparameter to evaluate based on the previous trials.

Incremental Query for Large Streaming Data Operation

4 minute read

Published: February 22, 2020

In the previous post, I wrote about how to perform pandas groupBy operation on a large dataset in streaming way. The main problem being addressed is optimum memory consumption since the data size might be extremely large.

Streaming GroupBy for Large Datasets with Pandas

7 minute read

Published: February 15, 2020

I came across an article about how to perform groupBy operation for large dataset. Long story short, the author proposes an approach called streaming groupBy where the dataset is divided into chunks and the groupBy operation is applied to each chunk. This approach is implemented with pandas.

Making H2O Cluster Information Shows Plausible Results for Total & Allowed Cores

1 minute read

Published: November 20, 2019

H2O provides a platform for building machine learning models in a scalable way. By focusing on scalability, it leverages the concept of cluster computing and therefore enables engineers to perform big data analytics.

Apache Cassandra: Begins with Docker

2 minute read

Published: August 17, 2019

This article is about how to install Cassandra and play with several of its query languages. To accomplish that, I’m going to utilize Docker.

Speeding Up Window Function by Repartitioning the Dataframe First

4 minute read

Published: May 17, 2019

The concept of window function in Spark is pretty interesting. One of its primary usage is calculating cumulative values. Here’s a simple example.

Permanent and Temporary External Table in BigQuery

6 minute read

Published: November 17, 2020

In BigQuery, an external data source is a data source that we can query directly although the data is not stored in BigQuery’s storage. We can query the data source just by creating an external table that refers to the data source instead of loading it to BigQuery.

Borderline-SMOTE01

2 minute read

Published: January 25, 2019

Relevant Paper: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning

Level 3. Ah_Choo!

10 minute read

Published: July 10, 2016

Primary purpose:

Level 2. Hi~

14 minute read

Published: July 09, 2016

Primary purpose:

Level 1. Shooting Star

8 minute read

Published: July 08, 2016

Primary purpose:

Level 0. Good Night Like Yesterday

8 minute read

Published: July 07, 2016

Primary purpose:

Buffer Lab

5 minute read

Published: July 06, 2016

Purpose:

Examples of Buffer Overflow Attack

4 minute read

Published: July 05, 2016

In the earlier section we have learnt a bit about buffer overflow technique. The primary concept is flooding the stack frame with input exceeding the buffer limit so that we can manipulate any datas saved on the stack frame. Some things that can be done using this technique are change the return address so that the attackers can call any functions they want, change the content of variables so that the function executes corresponding code, or change the return value of a function.

Stack Frame

5 minute read

Published: July 04, 2016

To discuss about this stack frame, we’ll see from Assembly language point of view.

What is Buffer Overflow?

1 minute read

Published: July 03, 2016

Buffer Overflow is one of code’s exploitation technique which uses buffer weakness. In addition, buffer is a block or space for saving datas.

Proof of Wallis Product for Pi with Euler’s Infinite Product for Sine

2 minute read

Published: May 22, 2020

In the previous post I mentioned about how to demonstrate the Wallis product for pi by starting from the powered sine integration.

Wallis Product for Pi with Integration: Want the Proof!

6 minute read

Published: May 21, 2020

The Wallis’ infinite product for pi states the following.

CAP Theorem

3 minute read

Published: October 25, 2019

“Consistency, Availability, and Partition Tolerance” - choose two.

Apache Cassandra: Begins with Docker

2 minute read

Published: August 17, 2019

This article is about how to install Cassandra and play with several of its query languages. To accomplish that, I’m going to utilize Docker.

The Investigation of Skewness & Kurtosis in Spark (Scala)

5 minute read

Published: August 18, 2020

Applying central moment functions in Spark might be tricky, especially for skewness and kurtosis.

XGBoost Algorithm for Classification Problem

12 minute read

Published: February 11, 2021

Let’s use a simple example data to demonstrate how XGBoost algorithm works on a classification problem.

Gradient Boosting Algorithm for Classification Problem

13 minute read

Published: December 17, 2020

In the previous post I mentioned about how Gradient Boosting algorithm works for a regression problem.

Bug on Airflow When Polling Spark Job Status Deployed with Cluster Mode

9 minute read

Published: December 09, 2019

I was thinking of the following case.

Airflow Feature Improvement: Spark Driver Status Polling Support for YARN, Mesos & K8S

1 minute read

Published: December 14, 2019

According to the code base, the driver status tracking feature is only implemented for standalone cluster manager. However, based on this reference, we could also poll the driver status for mesos and kubernetes (cluster deploy mode). Additionally, such a feature is also possible for YARN.

Build a Simple Recommendation System using Python

13 minute read

Published: January 06, 2017

Introduction

F.col() Behavior With Non-Existing Referred Columns on Dataframe Operations

1 minute read

Published: October 06, 2019

I came across an odd use case when applying F.col() on certain dataframe operations on PySpark v.2.4.0. Please note that the context of this issue is on Oct 6, 2019. Such an issue might have been solved in the future.

Support Vector Machine - Part 2 - Computing the Margin

3 minute read

Published: January 14, 2017

Introduction

Level 3. Ah_Choo!

10 minute read

Published: July 10, 2016

Primary purpose:

Level 2. Hi~

14 minute read

Published: July 09, 2016

Primary purpose:

Level 1. Shooting Star

8 minute read

Published: July 08, 2016

Primary purpose:

Level 0. Good Night Like Yesterday

8 minute read

Published: July 07, 2016

Primary purpose:

Buffer Lab

5 minute read

Published: July 06, 2016

Purpose:

Examples of Buffer Overflow Attack

4 minute read

Published: July 05, 2016

In the earlier section we have learnt a bit about buffer overflow technique. The primary concept is flooding the stack frame with input exceeding the buffer limit so that we can manipulate any datas saved on the stack frame. Some things that can be done using this technique are change the return address so that the attackers can call any functions they want, change the content of variables so that the function executes corresponding code, or change the return value of a function.

Stack Frame

5 minute read

Published: July 04, 2016

To discuss about this stack frame, we’ll see from Assembly language point of view.

What is Buffer Overflow?

1 minute read

Published: July 03, 2016

Buffer Overflow is one of code’s exploitation technique which uses buffer weakness. In addition, buffer is a block or space for saving datas.

The Monty Hall Problem Using Conditional Probability

2 minute read

Published: August 03, 2019

The Monty Hall Problem can be stated as the following:

Kafka Consumer Awareness of New Topic Partitions

6 minute read

Published: October 23, 2019

Just wanted to confirm whether the Kafka consumers were aware of new topic’s partitions.

The Legendary Question Six IMO 1988

9 minute read

Published: June 01, 2020

The final problem of the International Mathematics Olympiad (IMO) 1988 is considered to be the most difficult problem on the contest.

Kalman Filter for Dynamic State & Multiple Measurements

2 minute read

Published: October 03, 2020

In the previous post, we discuss about the implementation of Kalman filter for static state (the true value of the object’s states are constant over time). In addition, the Kalman filter algorithm is applied to estimate single true value.

Kalman Filter for Static State & Single Measurement

1 minute read

Published: September 20, 2020

Kalman filter is an iterative mathematical process applied on consecutive data inputs to quickly estimate the true value (position, velocity, weight, temperature, etc) of the object being measured, when the measured values contain random error or uncertainty.

Tackling Covariate Shift in ML Using ML

2 minute read

Published: March 31, 2020

In the previous post I mentioned about a simple way of estimating the density ratio of two probability distributions. I decided to create a python package that provides such a functionality.

Density Ratio Estimation with Probabilistic Classification for Handling Covariate Shift

4 minute read

Published: March 13, 2020

In the previous post I shared about how to detect covariate shift with a simple technique–model based approach. After knowing that the data distribution changes, what can we do to address such an issue?

Covariate Shift Detection with Machine Learning Based Approach

5 minute read

Published: March 07, 2020

Covariate shift happens when the distribution of train data differs with the distribution of test data. Take a look at the following probability equation.

Crosstab Does Not Yield the Same Result for Different Column Data Types

5 minute read

Published: September 20, 2019

I encountered an issue when applying crosstab function in PySpark to a pretty big data. And I think this should be considered as a pretty big issue. Please note that the context of this issue is on Sep 20, 2019. Such an issue might have been solved in the future.

Two-sample Kolmogorov-Smirnov Test for Empirical Distributions in Spark

9 minute read

Published: July 15, 2020

Kolmogorov-Smirnov (KS) test is a non-parametric test for the equality of probability distributions.

Kafka Partitioning Consistency After Topic Metadata Updates

4 minute read

Published: November 05, 2019

I used kafka-python v.1.4.7 as the client.

Modifying the Code Profiler to Use Custom sort_stats Sorters

3 minute read

Published: August 02, 2019

Code profiling is simply used to assess the code performance, including its functions and sub-functions within functions. One of its obvious usage is code optimisation where a developer wants to improve the code efficiency by searching for the bottlenecks in the code.

Two-sample Kolmogorov-Smirnov Test for Empirical Distributions in Spark

9 minute read

Published: July 15, 2020

Kolmogorov-Smirnov (KS) test is a non-parametric test for the equality of probability distributions.

Data Dredging (p-hacking)

3 minute read

Published: September 08, 2020

“If you torture the data long enough, it will confess to anything” - Ronald Coase.

Apache Griffin for Data Validation: Yay & Nay

10 minute read

Published: May 16, 2020

In the previous post, I mentioned that there are several observed points regarding Griffin during my exploration.

Data Quality with Apache Griffin Overview

4 minute read

Published: May 13, 2020

A few days back I was exploring a big data quality tool called Griffin. There are lots of DQ tools out there, such as Deequ, Target’s data validator, Tensorflow data validator, PySpark Owl, and Great Expectation. There’s another one called Cerberus. It doesn’t natively support large-scale data however.

Kruskal-Wallis Test Statistic Formula Derivation When No Tied Values Exist

7 minute read

Published: December 30, 2020

In the previous post, I mentioned about the general formula of the H statistic is the following (Source: Wikipedia - Kruskal–Wallis one-way analysis of variance):

Hypothesis Testing with the Kruskal-Wallis Test

6 minute read

Published: December 30, 2020

The Kruskal-Wallis test is a non-parametric statistical test that is used to evaluate whether the medians of two or more groups are different. Since the test is non-parametric, it doesn’t assume that the data comes from a particular distribution.

A Brief Report on GroupBy Operation After Dataframe Repartitioning

10 minute read

Published: April 19, 2019

A few days ago I did a little exploration on Spark’s groupBy behavior. Precisely, I wanted to see whether the order of the data was still preserved when applying groupBy on a repartitioned dataframe.

Setting Up Database in Hive Environment

1 minute read

Published: October 03, 2020

In this post, we’re going to look at how to set up a database along with the tables in Hive.

Adding Strictly Increasing ID to Spark Dataframes

3 minute read

Published: February 28, 2020

Recently I was exploring ways of adding a unique row ID column to a dataframe. The requirement is simple: “the row ID should strictly increase with difference of one and the data order is not modified”.

Handling Dot Character in Spark Dataframe Column Name (Partial Solution)

1 minute read

Published: January 02, 2020

A few days ago I came across a case where I needed to define a dataframe’s column name with a special character, that is a dot (‘.’). Take a look at thee following schema example.

Creating Nested Columns in PySpark Dataframe

2 minute read

Published: January 02, 2020

A nested column is basically just a column with one or more sub-columns. Take a look at the following example.

F.col() Behavior With Non-Existing Referred Columns on Dataframe Operations

1 minute read

Published: October 06, 2019

I came across an odd use case when applying F.col() on certain dataframe operations on PySpark v.2.4.0. Please note that the context of this issue is on Oct 6, 2019. Such an issue might have been solved in the future.

Crosstab Does Not Yield the Same Result for Different Column Data Types

5 minute read

Published: September 20, 2019

I encountered an issue when applying crosstab function in PySpark to a pretty big data. And I think this should be considered as a pretty big issue. Please note that the context of this issue is on Sep 20, 2019. Such an issue might have been solved in the future.

Resolving Attributes Data Inconsistency with Union By Name

1 minute read

Published: August 21, 2019

If you read my previous article titled Union Operation After Left-anti Join Might Result in Inconsistent Attributes Data, it was shown that the attributes data was inconsistent when combining two data frames after inner-join. According to the article, the solution is really simple. We just need to reorder the attributes order by using select command. Here’s a simple example.

Sigma Operation in Spark’s Dataframe

1 minute read

Published: August 08, 2019

Have you ever encountered a case where you need to compute the sum of a certain one-item operation? Consider the following example.

Union Operation After Left-anti Join Might Result in Inconsistent Attributes Data

3 minute read

Published: July 24, 2019

Unioning two dataframes after joining them with left_anti? Well, seems like a straightforward approach. However, recently I encountered a case where join operation might shift the location of the join key in the resulting dataframe. This, unfortunately, makes the dataframe’s merging result inconsistent in terms of the data in each attribute.

Resolving Reference Column Ambiguity After Self-Joining by Deep Copying the Dataframes

5 minute read

Published: June 28, 2019

I encountered an intriguing result when joining a dataframe with itself (self-join). As you might have already known, one of the problems occurred when doing a self-join relates to duplicated column names. Because of this duplication, there’s an ambiguity when we do operations requiring us to provide the column names.

The Number of Partitions After Unioning Two or More Dataframes

9 minute read

Published: June 14, 2019

An intriguing question popped into my mind. After unioning several dataframes, how many partitions the resulting dataframe will have?

How to Check the Size of a Dataframe?

1 minute read

Published: May 31, 2019

Have you ever wondered how the size of a dataframe can be discovered? Perhaps it sounds not so fancy thing to know, yet I think there are certain cases requiring us to have pre-knowledge of the size of our dataframe. One of them is when we want to apply broadcast operation. As you might’ve already knownn, broadcasting requires the dataframe to be small enough to fit in memory in each executor. This implicitly means that we should know about the size of the dataframe beforehand in order for broadcasting to be applied successfully. Just FYI, broadcasting enables us to configure the maximum size of a dataframe that can be pushed into each executor. Precisely, this maximum size can be configured via spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, MAX_SIZE).

Effects of Shuffling on RDDs and Dataframes Partitioning

8 minute read

Published: May 21, 2019

In Spark, data shuffling simply means data movement. In a single machine with multiple partitions, data shuffling means that data move from one partition to another partition. Meanwhile, in multiple machines, data shuffling can have two kinds of work. The first one is data move from one partition (A) to another partition (B) within the same machine (M1), while the second one is data move from partition B to another partition (C) within different machine (M2). Data in partition C might be moved to another partition within different machine again (M3).

Speeding Up Window Function by Repartitioning the Dataframe First

4 minute read

Published: May 17, 2019

The concept of window function in Spark is pretty interesting. One of its primary usage is calculating cumulative values. Here’s a simple example.

Speeding Up Parquet Write

3 minute read

Published: May 15, 2019

Parquet is a file format with columnar style. Columnar style means that we don’t store the content of each row of the data. Here’s a simple example.

Too Lazy to Process the Whole Dataframe

2 minute read

Published: May 09, 2019

One of the characteristics of Spark that makes me interested to explore this framework further is its lazy evaluation approach. Simply put, Spark won’t execute the transformation until an action is called. I think it’s logical since when we only specify the transformation plan and don’t ask it to execute the plan, why it needs to force itself to do the computation on the data? In addition, by implementing this lazy evaluation approach, Spark might be able to optimize the logical plan. The task of making the query to be more efficient manually might be reduced significantly. Cool, right?

A Brief Report on GroupBy Operation After Dataframe Repartitioning

10 minute read

Published: April 19, 2019

A few days ago I did a little exploration on Spark’s groupBy behavior. Precisely, I wanted to see whether the order of the data was still preserved when applying groupBy on a repartitioned dataframe.

Custom Partitioner for Repartitioning in Spark

8 minute read

Published: April 06, 2019

A statement I encountered a few days ago: “Avoid to use Resilient Distributed Datasets (RDDs) and use Dataframes/Datasets (DFs/DTs) instead, especially in production stage”.

List of Spark Machine Learning Models & Non-overwritten Prediction Columns

2 minute read

Published: March 30, 2019

I was implementing a paper related to balanced random forest (BRF). Just FYI, a BRF consists of some decision trees where each tree receives instances with a ratio of 1:1 for minority and majority class. A BRF also uses m features selected randomly to determine the best split.

A Little Experiment on Dataframe Repartitioning

3 minute read

Published: March 29, 2019

Spark has two types of partitioning. The first one is coalesce, while the second one is repartition.

RDD to DF Gave a StopIteration Exception

1 minute read

Published: March 06, 2019

I made a silly mistake a few days ago - well, yes.

The Infinite Hotel Paradox by David Hilbert

2 minute read

Published: July 28, 2019

Recently I watched a YouTube video about the infinite hotel paradox which was introduced in 1920s by a German mathematician, David Hilbert. In case you’re curious about he video, just search on YouTube using “The Infinite Hotel Paradox” keyword.

Resolving Reference Column Ambiguity After Self-Joining by Deep Copying the Dataframes

5 minute read

Published: June 28, 2019

I encountered an intriguing result when joining a dataframe with itself (self-join). As you might have already known, one of the problems occurred when doing a self-join relates to duplicated column names. Because of this duplication, there’s an ambiguity when we do operations requiring us to provide the column names.

Parzen Window Density Estimation - Kernel and Bandwidth

3 minute read

Published: December 11, 2020

Imagine that you have some data x1, x2, x3, ..., xn originating from an unknown continuous distribution f. You’d like to estimate f.

Density Ratio Estimation with Probabilistic Classification for Handling Covariate Shift

4 minute read

Published: March 13, 2020

In the previous post I shared about how to detect covariate shift with a simple technique–model based approach. After knowing that the data distribution changes, what can we do to address such an issue?

Tackling Covariate Shift in ML Using ML

2 minute read

Published: March 31, 2020

In the previous post I mentioned about a simple way of estimating the density ratio of two probability distributions. I decided to create a python package that provides such a functionality.

Pseudo-distributed LIME via PySpark UDF

2 minute read

Published: January 22, 2020

The initial question that popped up in my mind was how to make LIME performs faster. This should be useful enough when the data to explain is big enough.

Lightning Fast Pandas UDF

5 minute read

Published: May 02, 2019

Spark functions (UDFs) are simply functions created to overcome speed performance problem when you want to process a dataframe. It’d be useful when your Python functions were so slow in processing a dataframe in large scale. When you use a Python function, it will process the dataframe with one-row-at-a-time manner, meaning that the process would be executed sequentially. Meanwhile, if you use a Spark UDF, Spark will distribute the dataframe and the Spark UDF to the provided executors. Hence, the dataframe processing would be executed in parallel. For more information about Spark UDF, please take a look at this post.

Distributed LIME with PySpark UDF vs MMLSpark

1 minute read

Published: January 28, 2020

In the previous post, I wrote about how to make LIME run in pseudo-distributed mode with PySpark UDF.

CAP Theorem

3 minute read

Published: October 25, 2019

“Consistency, Availability, and Partition Tolerance” - choose two.

Maximum Likelihood Estimation

3 minute read

Published: July 24, 2020

If in the probability context we state that P(x1, x2, ..., xn | params) means the probability of getting a set of observations x1, x2, …, and xn given the distribution parameters, then in the likelihood context we get the following.

Kullback-Leibler Divergence for Empirical Probability Distributions in Spark

7 minute read

Published: July 19, 2020

In the previous post, I mentioned about the basic concept of two-sample Kolmogorov-Smirnov (KS) test and its implementation in Spark (Scala API).

Apache Cassandra: Begins with Docker

2 minute read

Published: August 17, 2019

This article is about how to install Cassandra and play with several of its query languages. To accomplish that, I’m going to utilize Docker.

Installing and Executing Rust Code via Docker

1 minute read

Published: August 13, 2019

The best way to try new technologies without having clutter? Docker.

Handling Dot Character in Spark Dataframe Column Name (Partial Solution)

1 minute read

Published: January 02, 2020

A few days ago I came across a case where I needed to define a dataframe’s column name with a special character, that is a dot (‘.’). Take a look at thee following schema example.

When GOD Granted That Opportunity: Part 1

8 minute read

Published: January 19, 2019

On November 15th, 2018, I promised myself I would write down my journey of accomplishing one of my dreams. This post is the realization of that word.

Bug on Airflow When Polling Spark Job Status Deployed with Cluster Mode

9 minute read

Published: December 09, 2019

I was thinking of the following case.

Retrieving Rows with Duplicate Values on the Columns of Interest in Spark

5 minute read

Published: June 06, 2020

There are several ways of removing duplicate rows in Spark. Two of them are by using distinct() and dropDuplicates(). The former lets us to remove rows with the same values on all the columns. Meanwhile, the latter lets us to remove rows with the same values on multiple selected columns.

Failure When Overwriting A Parquet File Might Result in Data Loss

1 minute read

Published: September 10, 2019

There are several critical issues that present when using Spark. One of them relates to data loss when a failure occurs.

Kalman Filter for Dynamic State & Multiple Measurements

2 minute read

Published: October 03, 2020

In the previous post, we discuss about the implementation of Kalman filter for static state (the true value of the object’s states are constant over time). In addition, the Kalman filter algorithm is applied to estimate single true value.

Two-sample Kolmogorov-Smirnov Test for Empirical Distributions in Spark

9 minute read

Published: July 15, 2020

Kolmogorov-Smirnov (KS) test is a non-parametric test for the equality of probability distributions.

Gradient Boosting Algorithm for Regression Problem

7 minute read

Published: December 01, 2020

In this post, we’re going to look at how Gradient Boosting algorithm works in a regression problem.

Infinitely Many Prime Numbers by Euclid

1 minute read

Published: July 13, 2019

To me, prime numbers are really interesting in terms of their position as the building blocks of other numbers. According to the Fundamental Theorem of Arithmetic, every positive integer N can be written as a product of P1, P2, P3, …, and Pk where Pi are all prime numbers.

Euler’s Pi for the Sum of Inverse Squares Proof

2 minute read

Published: May 29, 2020

Given an infinite series of inverse squares of the natural numbers, what is the sum?

Proof of Wallis Product for Pi with Euler’s Infinite Product for Sine

2 minute read

Published: May 22, 2020

In the previous post I mentioned about how to demonstrate the Wallis product for pi by starting from the powered sine integration.

The Most Beautiful Equation

2 minute read

Published: May 31, 2020

Euler’s formula is stated as the following.

Airflow Executor & Friends: How Actually Does the Executor Run the Task?

7 minute read

Published: November 29, 2019

A few days ago I did a small experiment with Airflow. To be precise, scheduling Airflow to run a Spark job via spark-submit to a standalone cluster. I have actually mentioned briefly about how to create a DAG and Operators in the previous post.

Local Interpretable Model-Agnostic Explanations (LIME)

5 minute read

Published: December 13, 2020

LIME is a python library used to explain predictions from any machine learning classifier.

Local Interpretable Model-Agnostic Explanations (LIME)

5 minute read

Published: December 13, 2020

LIME is a python library used to explain predictions from any machine learning classifier.

Permanent and Temporary External Table in BigQuery

6 minute read

Published: November 17, 2020

In BigQuery, an external data source is a data source that we can query directly although the data is not stored in BigQuery’s storage. We can query the data source just by creating an external table that refers to the data source instead of loading it to BigQuery.

Accessing Resources with Extra Classpath as spark-submit Config

less than 1 minute read

Published: July 17, 2020

A few days ago I came across a case where a module needs access to the resources directory.

How to Solve This Extreme Algebra?

6 minute read

Published: June 20, 2020

Here we’re gonna look at how to solve the following algebra problem.

XGBoost Algorithm for Classification Problem

12 minute read

Published: February 11, 2021

Let’s use a simple example data to demonstrate how XGBoost algorithm works on a classification problem.

Drawing ROC Curve Without Applying the Formulas

6 minute read

Published: December 24, 2019

One of the evaluation metrics that is often optimised is ROC-AUC. In this post, we’re going to discuss how an ROC curve is created.

Support Vector Machine - Part 3 (Final) - Finding the Optimal Hyperplane

5 minute read

Published: January 16, 2017

Introduction

Sigma Operation in Spark’s Dataframe

1 minute read

Published: August 08, 2019

Have you ever encountered a case where you need to compute the sum of a certain one-item operation? Consider the following example.

Permanent and Temporary External Table in BigQuery

6 minute read

Published: November 17, 2020

In BigQuery, an external data source is a data source that we can query directly although the data is not stored in BigQuery’s storage. We can query the data source just by creating an external table that refers to the data source instead of loading it to BigQuery.

Level 0. Good Night Like Yesterday

8 minute read

Published: July 07, 2016

Primary purpose:

Gradient Boosting Algorithm for Classification Problem

13 minute read

Published: December 17, 2020

In the previous post I mentioned about how Gradient Boosting algorithm works for a regression problem.

Gradient Boosting Algorithm for Regression Problem

7 minute read

Published: December 01, 2020

In this post, we’re going to look at how Gradient Boosting algorithm works in a regression problem.

Data Quality with Apache Griffin Overview

4 minute read

Published: May 13, 2020

A few days back I was exploring a big data quality tool called Griffin. There are lots of DQ tools out there, such as Deequ, Target’s data validator, Tensorflow data validator, PySpark Owl, and Great Expectation. There’s another one called Cerberus. It doesn’t natively support large-scale data however.

Retrieving Rows with Duplicate Values on the Columns of Interest in Spark

5 minute read

Published: June 06, 2020

There are several ways of removing duplicate rows in Spark. Two of them are by using distinct() and dropDuplicates(). The former lets us to remove rows with the same values on all the columns. Meanwhile, the latter lets us to remove rows with the same values on multiple selected columns.

Streaming GroupBy for Large Datasets with Pandas

7 minute read

Published: February 15, 2020

I came across an article about how to perform groupBy operation for large dataset. Long story short, the author proposes an approach called streaming groupBy where the dataset is divided into chunks and the groupBy operation is applied to each chunk. This approach is implemented with pandas.

Crosstab Does Not Yield the Same Result for Different Column Data Types

5 minute read

Published: September 20, 2019

I encountered an issue when applying crosstab function in PySpark to a pretty big data. And I think this should be considered as a pretty big issue. Please note that the context of this issue is on Sep 20, 2019. Such an issue might have been solved in the future.

A Brief Report on GroupBy Operation After Dataframe Repartitioning

10 minute read

Published: April 19, 2019

A few days ago I did a little exploration on Spark’s groupBy behavior. Precisely, I wanted to see whether the order of the data was still preserved when applying groupBy on a repartitioned dataframe.

Making H2O Cluster Information Shows Plausible Results for Total & Allowed Cores

1 minute read

Published: November 20, 2019

H2O provides a platform for building machine learning models in a scalable way. By focusing on scalability, it leverages the concept of cluster computing and therefore enables engineers to perform big data analytics.

Using SparkSQL in Metabase

5 minute read

Published: October 07, 2020

Basically, Metabase’s SparkSQL only allows users to access data in the Hive warehouse. In other words, the data must be in Hive table format to be able to be loaded.

Setting Up Database in Hive Environment

1 minute read

Published: October 03, 2020

In this post, we’re going to look at how to set up a database along with the tables in Hive.

Kafka Partitioning Consistency After Topic Metadata Updates

4 minute read

Published: November 05, 2019

I used kafka-python v.1.4.7 as the client.

Java Heap and Stack Memory (Behind the Scene)

7 minute read

Published: January 11, 2017

Introduction

Spark History Server: Setting Up & How It Works

2 minute read

Published: November 07, 2019

Application monitoring is critically important, especially when we encounter performance issues. In Spark, one way to monitor a Spark application is via Spark UI. The problem is, this Spark UI can only be accessed when the application is running.

Using SparkSQL in Metabase

5 minute read

Published: October 07, 2020

Basically, Metabase’s SparkSQL only allows users to access data in the Hive warehouse. In other words, the data must be in Hive table format to be able to be loaded.

Setting Up Database in Hive Environment

1 minute read

Published: October 03, 2020

In this post, we’re going to look at how to set up a database along with the tables in Hive.

Level 2. Hi~

14 minute read

Published: July 09, 2016

Primary purpose:

Tree Parzen Estimator in Bayesian Optimization for Hyperparameter Tuning

3 minute read

Published: December 03, 2020

One of the techniques in hyperparameter tuning is called Bayesian Optimization. It selects the next hyperparameter to evaluate based on the previous trials.

Kruskal-Wallis Test Statistic Formula Derivation When No Tied Values Exist

7 minute read

Published: December 30, 2020

In the previous post, I mentioned about the general formula of the H statistic is the following (Source: Wikipedia - Kruskal–Wallis one-way analysis of variance):

Hypothesis Testing with the Kruskal-Wallis Test

6 minute read

Published: December 30, 2020

The Kruskal-Wallis test is a non-parametric statistical test that is used to evaluate whether the medians of two or more groups are different. Since the test is non-parametric, it doesn’t assume that the data comes from a particular distribution.

Sample Size is Matter for Mean Difference Testing

1 minute read

Published: September 08, 2020

It’s quite bothering when reading a publication that only provides a “statistically significant” result without telling much about the analysis prior to conducting the experiment.

Data Dredging (p-hacking)

3 minute read

Published: September 08, 2020

“If you torture the data long enough, it will confess to anything” - Ronald Coase.

Implementing Balanced Random Forest via imblearn

3 minute read

Published: April 12, 2019

Have you ever heard of imblearn package? Based on its name, I think people who are familiar with machine learning are going to presume that it’s a package specifically created for tackling the problem of imbalanced data. If you do a deeper search, you’re gonna find its GitHub repository here. And yes, once again, it’s a Python package for playing with imbalanced data.

IMO 2012 Problem 2 - Solution

4 minute read

Published: March 20, 2021

Let’s play with the 2nd problem of the International Mathematics Olympiad (IMO) 2012.

The Legendary Question Six IMO 1988

9 minute read

Published: June 01, 2020

The final problem of the International Mathematics Olympiad (IMO) 1988 is considered to be the most difficult problem on the contest.

Density Ratio Estimation with Probabilistic Classification for Handling Covariate Shift

4 minute read

Published: March 13, 2020

In the previous post I shared about how to detect covariate shift with a simple technique–model based approach. After knowing that the data distribution changes, what can we do to address such an issue?

Incremental Query for Large Streaming Data Operation

4 minute read

Published: February 22, 2020

In the previous post, I wrote about how to perform pandas groupBy operation on a large dataset in streaming way. The main problem being addressed is optimum memory consumption since the data size might be extremely large.

IMO 2012 Problem 2 - Solution

4 minute read

Published: March 20, 2021

Let’s play with the 2nd problem of the International Mathematics Olympiad (IMO) 2012.

The Infinite Hotel Paradox by David Hilbert

2 minute read

Published: July 28, 2019

Recently I watched a YouTube video about the infinite hotel paradox which was introduced in 1920s by a German mathematician, David Hilbert. In case you’re curious about he video, just search on YouTube using “The Infinite Hotel Paradox” keyword.

Infinitely Many Prime Numbers by Euclid

1 minute read

Published: July 13, 2019

To me, prime numbers are really interesting in terms of their position as the building blocks of other numbers. According to the Fundamental Theorem of Arithmetic, every positive integer N can be written as a product of P1, P2, P3, …, and Pk where Pi are all prime numbers.

Vieta’s Infinite Products Representation for Pi: Show Me the Proof!

8 minute read

Published: May 23, 2020

Last time I wrote about the infinite products representation for pi that is regarded as the Wallis’ product for pi.

The Infinite Hotel Paradox by David Hilbert

2 minute read

Published: July 28, 2019

Recently I watched a YouTube video about the infinite hotel paradox which was introduced in 1920s by a German mathematician, David Hilbert. In case you’re curious about he video, just search on YouTube using “The Infinite Hotel Paradox” keyword.

Weight of Evidence & Information Value for Attributes Relevance Analysis with PySpark

1 minute read

Published: April 09, 2020

Woe & information value (IV) are used as a framework for attribute relevance analysis. WoE and IV can be utilised independently since each of them play different roles.

Repartitioning Input Data Stream

3 minute read

Published: June 11, 2019

Recently I played with a simple Spark Streaming application. Precisely, I investigated the behavior of repartitioning on different level of input data streams. For instance, we have two input data streams, such as linesDStream and wordsDStream. The question is, is the repartitioning result different if I repartition after linesDStream and after wordsDStream?

Local Interpretable Model-Agnostic Explanations (LIME)

5 minute read

Published: December 13, 2020

LIME is a python library used to explain predictions from any machine learning classifier.

Java Heap and Stack Memory (Behind the Scene)

7 minute read

Published: January 11, 2017

Introduction

Sentiment Analysis for Twitter using WEKA

16 minute read

Published: December 26, 2016

Introduction

Kafka Partitioning Consistency After Topic Metadata Updates

4 minute read

Published: November 05, 2019

I used kafka-python v.1.4.7 as the client.

Kafka Producer Awareness of Topic Metadata (Partitions) Updates

2 minute read

Published: November 04, 2019

In the previous article about Kafka Consumer Awareness of New Topic Partitions, I wrote about partitions balancing by Kafka consumers. In other words, I’d like to see whether Kafka consumers are aware of new topic partitions.

Kafka Consumer Awareness of New Topic Partitions

6 minute read

Published: October 23, 2019

Just wanted to confirm whether the Kafka consumers were aware of new topic’s partitions.

WTF is Kafka? A High-level Overview

7 minute read

Published: February 21, 2019

Basically, you can presume Kafka as a messaging system. When an application sends a message to another application, one thing they need to do is to specify how to send the message. The most obvious use case in using a messaging system, in my opinion, is when we’re dealing with big data. For instance, a sender application shares a large amount of data that need to be processed by a receiver application. However, the processing rate by the receiver is lower than the sending rate. Consequently, the receiver might be overloaded since it’s unable to receive messages anymore while the processing is running. Although we’re using distributed receivers, we still have to tell the sender about which receiver node it should send the message to.

Kalman Filter for Dynamic State & Multiple Measurements

2 minute read

Published: October 03, 2020

In the previous post, we discuss about the implementation of Kalman filter for static state (the true value of the object’s states are constant over time). In addition, the Kalman filter algorithm is applied to estimate single true value.

Kalman Filter for Static State & Single Measurement

1 minute read

Published: September 20, 2020

Kalman filter is an iterative mathematical process applied on consecutive data inputs to quickly estimate the true value (position, velocity, weight, temperature, etc) of the object being measured, when the measured values contain random error or uncertainty.

The Three-Headed Hound of the Underworld (Kerberos)

6 minute read

Published: January 14, 2020

Kerberos is simply a “ticket-based” authentication protocol. It enhances the security approach used by password-based authentication protocol. Since there might be a possibility for tappers to take over the password, Kerberos mitigates this by leveraging a ticket (how it is generated is explained below) that ideally should only be known by the client and the service.

Kullback-Leibler Divergence for Empirical Probability Distributions in Spark

7 minute read

Published: July 19, 2020

In the previous post, I mentioned about the basic concept of two-sample Kolmogorov-Smirnov (KS) test and its implementation in Spark (Scala API).

Two-sample Kolmogorov-Smirnov Test for Empirical Distributions in Spark

9 minute read

Published: July 15, 2020

Kolmogorov-Smirnov (KS) test is a non-parametric test for the equality of probability distributions.

Kruskal-Wallis Test Statistic Formula Derivation When No Tied Values Exist

7 minute read

Published: December 30, 2020

In the previous post, I mentioned about the general formula of the H statistic is the following (Source: Wikipedia - Kruskal–Wallis one-way analysis of variance):

Hypothesis Testing with the Kruskal-Wallis Test

6 minute read

Published: December 30, 2020

The Kruskal-Wallis test is a non-parametric statistical test that is used to evaluate whether the medians of two or more groups are different. Since the test is non-parametric, it doesn’t assume that the data comes from a particular distribution.

Kullback-Leibler Divergence for Empirical Probability Distributions in Spark

7 minute read

Published: July 19, 2020

In the previous post, I mentioned about the basic concept of two-sample Kolmogorov-Smirnov (KS) test and its implementation in Spark (Scala API).

The Investigation of Skewness & Kurtosis in Spark (Scala)

5 minute read

Published: August 18, 2020

Applying central moment functions in Spark might be tricky, especially for skewness and kurtosis.

Statistical Machine Translation - Language Models

8 minute read

Published: January 11, 2017

Introduction

Too Lazy to Process the Whole Dataframe

2 minute read

Published: May 09, 2019

One of the characteristics of Spark that makes me interested to explore this framework further is its lazy evaluation approach. Simply put, Spark won’t execute the transformation until an action is called. I think it’s logical since when we only specify the transformation plan and don’t ask it to execute the plan, why it needs to force itself to do the computation on the data? In addition, by implementing this lazy evaluation approach, Spark might be able to optimize the logical plan. The task of making the query to be more efficient manually might be reduced significantly. Cool, right?

Union Operation After Left-anti Join Might Result in Inconsistent Attributes Data

3 minute read

Published: July 24, 2019

Unioning two dataframes after joining them with left_anti? Well, seems like a straightforward approach. However, recently I encountered a case where join operation might shift the location of the join key in the resulting dataframe. This, unfortunately, makes the dataframe’s merging result inconsistent in terms of the data in each attribute.

The Legendary Question Six IMO 1988

9 minute read

Published: June 01, 2020

The final problem of the International Mathematics Olympiad (IMO) 1988 is considered to be the most difficult problem on the contest.

The Levinson-Durbin Recursion Example

4 minute read

Published: January 22, 2021

In the previous post I wrote about how to derive the Levinson-Durbin recursion.

The Levinson-Durbin Recursion Derivation

8 minute read

Published: January 20, 2021

I was exploring about the Levinson-Durbin recursion and curious about the formula’s derivation.

When GOD Granted That Opportunity: Part 1

8 minute read

Published: January 19, 2019

On November 15th, 2018, I promised myself I would write down my journey of accomplishing one of my dreams. This post is the realization of that word.

Local Interpretable Model-Agnostic Explanations (LIME)

5 minute read

Published: December 13, 2020

LIME is a python library used to explain predictions from any machine learning classifier.

Distributed LIME with PySpark UDF vs MMLSpark

1 minute read

Published: January 28, 2020

In the previous post, I wrote about how to make LIME run in pseudo-distributed mode with PySpark UDF.

Pseudo-distributed LIME via PySpark UDF

2 minute read

Published: January 22, 2020

The initial question that popped up in my mind was how to make LIME performs faster. This should be useful enough when the data to explain is big enough.

Submitting and Polling Spark Job Status with Apache Livy

5 minute read

Published: January 09, 2020

Livy offers a REST interface that is used to interact with Spark cluster. It provides two general approaches for job submission and monitoring.

User Sessions Addition Error When Submitting Spark Job to Apache Livy via Local Mode

1 minute read

Published: January 09, 2020

A few days back I tried to submit a Spark job to a Livy server deployed via local mode. The procedure was straightforward since the only thing to do was to specify the job file along with the configuration parameters (like what we do when using spark-submit directly).

Local Interpretable Model-Agnostic Explanations (LIME)

5 minute read

Published: December 13, 2020

LIME is a python library used to explain predictions from any machine learning classifier.

Running Local Mode Spark with Logging via spark-submit

less than 1 minute read

Published: June 30, 2020

Below is a script for running spark via spark-submit (local mode) that utilizes logging.

Running Local Mode Spark with Logging via spark-submit

less than 1 minute read

Published: June 30, 2020

Below is a script for running spark via spark-submit (local mode) that utilizes logging.

Rolling File Appender in PySpark Logging

1 minute read

Published: February 14, 2019

This article is about a brief overview of how to store several of the most recent log files in PySpark logging.

A Brief Overview of PySpark Logging

1 minute read

Published: February 08, 2019

This article is about a brief overview of how to write log messages using PySpark logging.

XGBoost Algorithm for Classification Problem

12 minute read

Published: February 11, 2021

Let’s use a simple example data to demonstrate how XGBoost algorithm works on a classification problem.

Gradient Boosting Algorithm for Classification Problem

13 minute read

Published: December 17, 2020

In the previous post I mentioned about how Gradient Boosting algorithm works for a regression problem.

Local Interpretable Model-Agnostic Explanations (LIME)

5 minute read

Published: December 13, 2020

LIME is a python library used to explain predictions from any machine learning classifier.

Parzen Window Density Estimation - Kernel and Bandwidth

3 minute read

Published: December 11, 2020

Imagine that you have some data x1, x2, x3, ..., xn originating from an unknown continuous distribution f. You’d like to estimate f.

Tree Parzen Estimator in Bayesian Optimization for Hyperparameter Tuning

3 minute read

Published: December 03, 2020

One of the techniques in hyperparameter tuning is called Bayesian Optimization. It selects the next hyperparameter to evaluate based on the previous trials.

Gradient Boosting Algorithm for Regression Problem

7 minute read

Published: December 01, 2020

In this post, we’re going to look at how Gradient Boosting algorithm works in a regression problem.

Maximum Likelihood Estimation - Normal Distribution

3 minute read

Published: August 08, 2020

In the previous post, I mentioned about the basic concept of maximum likelihood estimation (MLE). Please visit the post if you need a refresher.

Kullback-Leibler Divergence for Empirical Probability Distributions in Spark

7 minute read

Published: July 19, 2020

In the previous post, I mentioned about the basic concept of two-sample Kolmogorov-Smirnov (KS) test and its implementation in Spark (Scala API).

Two-sample Kolmogorov-Smirnov Test for Empirical Distributions in Spark

9 minute read

Published: July 15, 2020

Kolmogorov-Smirnov (KS) test is a non-parametric test for the equality of probability distributions.

Apache Griffin for Data Validation: Yay & Nay

10 minute read

Published: May 16, 2020

In the previous post, I mentioned that there are several observed points regarding Griffin during my exploration.

Data Quality with Apache Griffin Overview

4 minute read

Published: May 13, 2020

A few days back I was exploring a big data quality tool called Griffin. There are lots of DQ tools out there, such as Deequ, Target’s data validator, Tensorflow data validator, PySpark Owl, and Great Expectation. There’s another one called Cerberus. It doesn’t natively support large-scale data however.

Standard Error of Mean Estimate Derivation

4 minute read

Published: May 06, 2020

Suppose we conduct K experiments on a kind of measurement. On each experiment, we take N observations. In other words, we’ll have N * K data at the end.

Monotonic Binning for Weight of Evidence (WoE) Encoding

3 minute read

Published: April 25, 2020

I was experimenting with the weight of evidence (WoE) encoding for continuous data. The preparation is quite different from categorical data in terms of binning characteristics.

Multicollinearity - Large Estimating Betas Variance (Part 2)

4 minute read

Published: April 19, 2020

In the previous post, I mentioned about how collinearity affects the computation of the beta estimators.

Multicollinearity - A Bit of Maths Behind Why It is a Problem (Part 1)

7 minute read

Published: April 15, 2020

In simple terms, we could define collinearity as a condition where two variables are highly correlated (positively / negatively). When there are more than two variables, it’s sometimes referred as multicollinearity.

Weight of Evidence & Information Value for Attributes Relevance Analysis with PySpark

1 minute read

Published: April 09, 2020

Woe & information value (IV) are used as a framework for attribute relevance analysis. WoE and IV can be utilised independently since each of them play different roles.

Tackling Covariate Shift in ML Using ML

2 minute read

Published: March 31, 2020

In the previous post I mentioned about a simple way of estimating the density ratio of two probability distributions. I decided to create a python package that provides such a functionality.

Density Ratio Estimation with Probabilistic Classification for Handling Covariate Shift

4 minute read

Published: March 13, 2020

In the previous post I shared about how to detect covariate shift with a simple technique–model based approach. After knowing that the data distribution changes, what can we do to address such an issue?

Covariate Shift Detection with Machine Learning Based Approach

5 minute read

Published: March 07, 2020

Covariate shift happens when the distribution of train data differs with the distribution of test data. Take a look at the following probability equation.

Distributed LIME with PySpark UDF vs MMLSpark

1 minute read

Published: January 28, 2020

In the previous post, I wrote about how to make LIME run in pseudo-distributed mode with PySpark UDF.

Drawing ROC Curve Without Applying the Formulas

6 minute read

Published: December 24, 2019

One of the evaluation metrics that is often optimised is ROC-AUC. In this post, we’re going to discuss how an ROC curve is created.

Finding the Best Threshold that Maximizes Accuracy from ROC & PR Curve

4 minute read

Published: December 23, 2019

The problem is simple. How to find the best threshold from an ROC and PR curve that maximise a certain binary classification metric?

Making H2O Cluster Information Shows Plausible Results for Total & Allowed Cores

1 minute read

Published: November 20, 2019

H2O provides a platform for building machine learning models in a scalable way. By focusing on scalability, it leverages the concept of cluster computing and therefore enables engineers to perform big data analytics.

Implementing Balanced Random Forest via imblearn

3 minute read

Published: April 12, 2019

Have you ever heard of imblearn package? Based on its name, I think people who are familiar with machine learning are going to presume that it’s a package specifically created for tackling the problem of imbalanced data. If you do a deeper search, you’re gonna find its GitHub repository here. And yes, once again, it’s a Python package for playing with imbalanced data.

List of Spark Machine Learning Models & Non-overwritten Prediction Columns

2 minute read

Published: March 30, 2019

I was implementing a paper related to balanced random forest (BRF). Just FYI, a BRF consists of some decision trees where each tree receives instances with a ratio of 1:1 for minority and majority class. A BRF also uses m features selected randomly to determine the best split.

Questions on Balanced Random Forest & Its Implementation

3 minute read

Published: March 15, 2019

I came across a research paper related to balanced random forest for imbalanced data. For the sake of clarity, the following is the algorithm of BRF taken from the paper:

Borderline-SMOTE01

2 minute read

Published: January 25, 2019

Relevant Paper: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning

Synthetic Minority Over-sampling Technique (SMOTE)

3 minute read

Published: January 25, 2019

First article in 2019.

Support Vector Machine - Part 3 (Final) - Finding the Optimal Hyperplane

5 minute read

Published: January 16, 2017

Introduction

Support Vector Machine - Part 2 - Computing the Margin

3 minute read

Published: January 14, 2017

Introduction

Support Vector Machine - Overview

4 minute read

Published: January 13, 2017

Introduction

Sentiment Analysis for Twitter using WEKA

16 minute read

Published: December 26, 2016

Introduction

Get Acquainted with Rule-based Machine Translation

7 minute read

Published: January 10, 2017

Introduction

The Most Beautiful Equation

2 minute read

Published: May 31, 2020

Euler’s formula is stated as the following.

Making mapPartitions Accepts Partition Functions with More Than One Arguments

1 minute read

Published: October 16, 2019

There might be a case where we need to perform a certain operation on each data partition. One of the most common examples is the use of mapPartitions. Sometimes, such an operation probably requires a more complicated procedure. This, in the end, makes the method executing the operation needs more than one parameter.

How to Solve This Extreme Algebra?

6 minute read

Published: June 20, 2020

Here we’re gonna look at how to solve the following algebra problem.

Vieta Triple Roots: 2019 American Invitational Mathematics Examination (AIME) I Problem 10

4 minute read

Published: June 07, 2020

Let’s take a look at the problem statement.

The Legendary Question Six IMO 1988

9 minute read

Published: June 01, 2020

The final problem of the International Mathematics Olympiad (IMO) 1988 is considered to be the most difficult problem on the contest.

The Most Beautiful Equation

2 minute read

Published: May 31, 2020

Euler’s formula is stated as the following.

Euler’s Pi for the Sum of Inverse Squares Proof

2 minute read

Published: May 29, 2020

Given an infinite series of inverse squares of the natural numbers, what is the sum?

Vieta’s Infinite Products Representation for Pi: Show Me the Proof!

8 minute read

Published: May 23, 2020

Last time I wrote about the infinite products representation for pi that is regarded as the Wallis’ product for pi.

Proof of Wallis Product for Pi with Euler’s Infinite Product for Sine

2 minute read

Published: May 22, 2020

In the previous post I mentioned about how to demonstrate the Wallis product for pi by starting from the powered sine integration.

Wallis Product for Pi with Integration: Want the Proof!

6 minute read

Published: May 21, 2020

The Wallis’ infinite product for pi states the following.

The Infinite Hotel Paradox by David Hilbert

2 minute read

Published: July 28, 2019

Recently I watched a YouTube video about the infinite hotel paradox which was introduced in 1920s by a German mathematician, David Hilbert. In case you’re curious about he video, just search on YouTube using “The Infinite Hotel Paradox” keyword.

IMO 2012 Problem 2 - Solution

4 minute read

Published: March 20, 2021

Let’s play with the 2nd problem of the International Mathematics Olympiad (IMO) 2012.

The Levinson-Durbin Recursion Example

4 minute read

Published: January 22, 2021

In the previous post I wrote about how to derive the Levinson-Durbin recursion.

The Levinson-Durbin Recursion Derivation

8 minute read

Published: January 20, 2021

I was exploring about the Levinson-Durbin recursion and curious about the formula’s derivation.

Moment Generating Function

2 minute read

Published: August 25, 2020

As the name suggests, moment generating function (MGF) provides a function that generates moments, such as E[X], E[X^2], E[X^3], and so forth.

The Investigation of Skewness & Kurtosis in Spark (Scala)

5 minute read

Published: August 18, 2020

Applying central moment functions in Spark might be tricky, especially for skewness and kurtosis.

One-sample Z-test with p-value Approach

8 minute read

Published: August 15, 2020

One sample z-test is used to examine whether the difference between a population mean and a certain value is significant.

Ramanujan’s Nested Cube Roots Proof

6 minute read

Published: June 18, 2020

The theorem of nested cube roots (Ramanujan) states the following.

Multicollinearity - Large Estimating Betas Variance (Part 2)

4 minute read

Published: April 19, 2020

In the previous post, I mentioned about how collinearity affects the computation of the beta estimators.

Multicollinearity - A Bit of Maths Behind Why It is a Problem (Part 1)

7 minute read

Published: April 15, 2020

In simple terms, we could define collinearity as a condition where two variables are highly correlated (positively / negatively). When there are more than two variables, it’s sometimes referred as multicollinearity.

Riemann Hypothesis and One Question in My Mind

7 minute read

Published: July 10, 2019

Yesterday I came across an interesting Math paper discussing about the Riemann hypothesis. Regarding the concept itself, there’s lots of maths but I think I enjoyed the reading. Frankly speaking, although mathematics is one of my favourite subjects, I’ve been rarely playing with it (esp. pure maths) since I got acquainted with AI and big data engineering world. Now I think it’s just fine to play with it again. Just for fun.

Maximum Likelihood Estimation - Normal Distribution

3 minute read

Published: August 08, 2020

In the previous post, I mentioned about the basic concept of maximum likelihood estimation (MLE). Please visit the post if you need a refresher.

Maximum Likelihood Estimation

3 minute read

Published: July 24, 2020

If in the probability context we state that P(x1, x2, ..., xn | params) means the probability of getting a set of observations x1, x2, …, and xn given the distribution parameters, then in the likelihood context we get the following.

One-sample Z-test with p-value Approach

8 minute read

Published: August 15, 2020

One sample z-test is used to examine whether the difference between a population mean and a certain value is significant.

WTF is Kafka? A High-level Overview

7 minute read

Published: February 21, 2019

Basically, you can presume Kafka as a messaging system. When an application sends a message to another application, one thing they need to do is to specify how to send the message. The most obvious use case in using a messaging system, in my opinion, is when we’re dealing with big data. For instance, a sender application shares a large amount of data that need to be processed by a receiver application. However, the processing rate by the receiver is lower than the sending rate. Consequently, the receiver might be overloaded since it’s unable to receive messages anymore while the processing is running. Although we’re using distributed receivers, we still have to tell the sender about which receiver node it should send the message to.

Using SparkSQL in Metabase

5 minute read

Published: October 07, 2020

Basically, Metabase’s SparkSQL only allows users to access data in the Hive warehouse. In other words, the data must be in Hive table format to be able to be loaded.

Kafka Partitioning Consistency After Topic Metadata Updates

4 minute read

Published: November 05, 2019

I used kafka-python v.1.4.7 as the client.

Kafka Producer Awareness of Topic Metadata (Partitions) Updates

2 minute read

Published: November 04, 2019

In the previous article about Kafka Consumer Awareness of New Topic Partitions, I wrote about partitions balancing by Kafka consumers. In other words, I’d like to see whether Kafka consumers are aware of new topic partitions.

Maximum Likelihood Estimation

3 minute read

Published: July 24, 2020

If in the probability context we state that P(x1, x2, ..., xn | params) means the probability of getting a set of observations x1, x2, …, and xn given the distribution parameters, then in the likelihood context we get the following.

Distributed LIME with PySpark UDF vs MMLSpark

1 minute read

Published: January 28, 2020

In the previous post, I wrote about how to make LIME run in pseudo-distributed mode with PySpark UDF.

Local Interpretable Model-Agnostic Explanations (LIME)

5 minute read

Published: December 13, 2020

LIME is a python library used to explain predictions from any machine learning classifier.

Moment Generating Function

2 minute read

Published: August 25, 2020

As the name suggests, moment generating function (MGF) provides a function that generates moments, such as E[X], E[X^2], E[X^3], and so forth.

Moment Generating Function

2 minute read

Published: August 25, 2020

As the name suggests, moment generating function (MGF) provides a function that generates moments, such as E[X], E[X^2], E[X^3], and so forth.

Creating Percentile Table with a Specified Increment in MongoDB

10 minute read

Published: October 27, 2020

Here’s the scenario.

Spark History Server: Setting Up & How It Works

2 minute read

Published: November 07, 2019

Application monitoring is critically important, especially when we encounter performance issues. In Spark, one way to monitor a Spark application is via Spark UI. The problem is, this Spark UI can only be accessed when the application is running.

Monotonic Binning for Weight of Evidence (WoE) Encoding

3 minute read

Published: April 25, 2020

I was experimenting with the weight of evidence (WoE) encoding for continuous data. The preparation is quite different from categorical data in terms of binning characteristics.

The Monty Hall Problem Using Conditional Probability

2 minute read

Published: August 03, 2019

The Monty Hall Problem can be stated as the following:

Multicollinearity - Large Estimating Betas Variance (Part 2)

4 minute read

Published: April 19, 2020

In the previous post, I mentioned about how collinearity affects the computation of the beta estimators.

Multicollinearity - A Bit of Maths Behind Why It is a Problem (Part 1)

7 minute read

Published: April 15, 2020

In simple terms, we could define collinearity as a condition where two variables are highly correlated (positively / negatively). When there are more than two variables, it’s sometimes referred as multicollinearity.

Kalman Filter for Dynamic State & Multiple Measurements

2 minute read

Published: October 03, 2020

In the previous post, we discuss about the implementation of Kalman filter for static state (the true value of the object’s states are constant over time). In addition, the Kalman filter algorithm is applied to estimate single true value.

Spark Structured Streaming with Parquet Stream Source & Multiple Stream Queries

3 minute read

Published: November 15, 2019

Whenever we call dataframe.writeStream.start() in structured streaming, Spark creates a new stream that reads from a data source (specified by dataframe.readStream). The data passed through the stream is then processed (if needed) and sinked to a certain location.

Little Note on MySQL and Adminer

less than 1 minute read

Published: February 16, 2021

Just a little note about MySQL and Adminer.

Using SparkSQL in Metabase

5 minute read

Published: October 07, 2020

Basically, Metabase’s SparkSQL only allows users to access data in the Hive warehouse. In other words, the data must be in Hive table format to be able to be loaded.

Python-like to Algorithm Specification

7 minute read

Published: December 07, 2017

An interesting paper: http://www.phontron.com/paper/oda15ase.pdf

Statistical Machine Translation - Language Models

8 minute read

Published: January 11, 2017

Introduction

Get Acquainted with Rule-based Machine Translation

7 minute read

Published: January 10, 2017

Introduction

Sentiment Analysis for Twitter using WEKA

16 minute read

Published: December 26, 2016

Introduction

Creating Nested Columns in PySpark Dataframe

2 minute read

Published: January 02, 2020

A nested column is basically just a column with one or more sub-columns. Take a look at the following example.

Ramanujan’s Nested Cube Roots Proof

6 minute read

Published: June 18, 2020

The theorem of nested cube roots (Ramanujan) states the following.

Kruskal-Wallis Test Statistic Formula Derivation When No Tied Values Exist

7 minute read

Published: December 30, 2020

In the previous post, I mentioned about the general formula of the H statistic is the following (Source: Wikipedia - Kruskal–Wallis one-way analysis of variance):

Hypothesis Testing with the Kruskal-Wallis Test

6 minute read

Published: December 30, 2020

The Kruskal-Wallis test is a non-parametric statistical test that is used to evaluate whether the medians of two or more groups are different. Since the test is non-parametric, it doesn’t assume that the data comes from a particular distribution.

Maximum Likelihood Estimation - Normal Distribution

3 minute read

Published: August 08, 2020

In the previous post, I mentioned about the basic concept of maximum likelihood estimation (MLE). Please visit the post if you need a refresher.

Apache Cassandra: Begins with Docker

2 minute read

Published: August 17, 2019

This article is about how to install Cassandra and play with several of its query languages. To accomplish that, I’m going to utilize Docker.

The Legendary Question Six IMO 1988

9 minute read

Published: June 01, 2020

The final problem of the International Mathematics Olympiad (IMO) 1988 is considered to be the most difficult problem on the contest.

Obfuscation Modes in PyArmor

3 minute read

Published: December 20, 2019

I think one of the unique features provided by PyArmor is that it lets the users to configure the ways to obfuscate the codes.

Obfuscating Python Scripts with PyArmor

11 minute read

Published: December 08, 2019

Basically, code obfuscation is a technique used to modify the source code so that it becomes difficult to understand but remains fully functional. The main objective is to protect intellectual properties and prevent hackers from reverse engineering a proprietary source code.

The Legendary Question Six IMO 1988

9 minute read

Published: June 01, 2020

The final problem of the International Mathematics Olympiad (IMO) 1988 is considered to be the most difficult problem on the contest.

Borderline-SMOTE01

2 minute read

Published: January 25, 2019

Relevant Paper: Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning

Synthetic Minority Over-sampling Technique (SMOTE)

3 minute read

Published: January 25, 2019

First article in 2019.

Data Dredging (p-hacking)

3 minute read

Published: September 08, 2020

“If you torture the data long enough, it will confess to anything” - Ronald Coase.

One-sample Z-test with p-value Approach

8 minute read

Published: August 15, 2020

One sample z-test is used to examine whether the difference between a population mean and a certain value is significant.

Streaming GroupBy for Large Datasets with Pandas

7 minute read

Published: February 15, 2020

I came across an article about how to perform groupBy operation for large dataset. Long story short, the author proposes an approach called streaming groupBy where the dataset is divided into chunks and the groupBy operation is applied to each chunk. This approach is implemented with pandas.

Lightning Fast Pandas UDF

5 minute read

Published: May 02, 2019

Spark functions (UDFs) are simply functions created to overcome speed performance problem when you want to process a dataframe. It’d be useful when your Python functions were so slow in processing a dataframe in large scale. When you use a Python function, it will process the dataframe with one-row-at-a-time manner, meaning that the process would be executed sequentially. Meanwhile, if you use a Spark UDF, Spark will distribute the dataframe and the Spark UDF to the provided executors. Hence, the dataframe processing would be executed in parallel. For more information about Spark UDF, please take a look at this post.

The Infinite Hotel Paradox by David Hilbert

2 minute read

Published: July 28, 2019

Recently I watched a YouTube video about the infinite hotel paradox which was introduced in 1920s by a German mathematician, David Hilbert. In case you’re curious about he video, just search on YouTube using “The Infinite Hotel Paradox” keyword.

Spark Structured Streaming with Parquet Stream Source & Multiple Stream Queries

3 minute read

Published: November 15, 2019

Whenever we call dataframe.writeStream.start() in structured streaming, Spark creates a new stream that reads from a data source (specified by dataframe.readStream). The data passed through the stream is then processed (if needed) and sinked to a certain location.

Structured Streaming Checkpointing with Parquet Stream Source

3 minute read

Published: October 12, 2019

I was curious about how checkpoint files in Spark structured streaming looked like. To introduce the basic concept, checkpointing simply denotes the progress information of streaming process. This checkpoint files are usually used for failure recovery. More detail explanation can be found here.

Failure When Overwriting A Parquet File Might Result in Data Loss

1 minute read

Published: September 10, 2019

There are several critical issues that present when using Spark. One of them relates to data loss when a failure occurs.

Speeding Up Parquet Write

3 minute read

Published: May 15, 2019

Parquet is a file format with columnar style. Columnar style means that we don’t store the content of each row of the data. Here’s a simple example.

Kafka Partitioning Consistency After Topic Metadata Updates

4 minute read

Published: November 05, 2019

I used kafka-python v.1.4.7 as the client.

Kafka Producer Awareness of Topic Metadata (Partitions) Updates

2 minute read

Published: November 04, 2019

In the previous article about Kafka Consumer Awareness of New Topic Partitions, I wrote about partitions balancing by Kafka consumers. In other words, I’d like to see whether Kafka consumers are aware of new topic partitions.

Kafka Consumer Awareness of New Topic Partitions

6 minute read

Published: October 23, 2019

Just wanted to confirm whether the Kafka consumers were aware of new topic’s partitions.

Speeding Up Parquet Write

3 minute read

Published: May 15, 2019

Parquet is a file format with columnar style. Columnar style means that we don’t store the content of each row of the data. Here’s a simple example.

Effects of Shuffling on RDDs and Dataframes Partitioning

8 minute read

Published: May 21, 2019

In Spark, data shuffling simply means data movement. In a single machine with multiple partitions, data shuffling means that data move from one partition to another partition. Meanwhile, in multiple machines, data shuffling can have two kinds of work. The first one is data move from one partition (A) to another partition (B) within the same machine (M1), while the second one is data move from partition B to another partition (C) within different machine (M2). Data in partition C might be moved to another partition within different machine again (M3).

Ensuring Dataframe Partitions After Equi-joining (Inner)

6 minute read

Published: June 13, 2019

The problem is really simple. After equi-joining (inner) two dataframes, a certain operation is applied to each partition. Precisely, such an operation can be accomplished by the following code:

Parzen Window Density Estimation - Kernel and Bandwidth

3 minute read

Published: December 11, 2020

Imagine that you have some data x1, x2, x3, ..., xn originating from an unknown continuous distribution f. You’d like to estimate f.

Creating Percentile Table with a Specified Increment in MongoDB

10 minute read

Published: October 27, 2020

Here’s the scenario.

Spark History Server: Setting Up & How It Works

2 minute read

Published: November 07, 2019

Application monitoring is critically important, especially when we encounter performance issues. In Spark, one way to monitor a Spark application is via Spark UI. The problem is, this Spark UI can only be accessed when the application is running.

Euler’s Pi for the Sum of Inverse Squares Proof

2 minute read

Published: May 29, 2020

Given an infinite series of inverse squares of the natural numbers, what is the sum?

Vieta’s Infinite Products Representation for Pi: Show Me the Proof!

8 minute read

Published: May 23, 2020

Last time I wrote about the infinite products representation for pi that is regarded as the Wallis’ product for pi.

Crosstab Does Not Yield the Same Result for Different Column Data Types

5 minute read

Published: September 20, 2019

I encountered an issue when applying crosstab function in PySpark to a pretty big data. And I think this should be considered as a pretty big issue. Please note that the context of this issue is on Sep 20, 2019. Such an issue might have been solved in the future.

Finding the Best Threshold that Maximizes Accuracy from ROC & PR Curve

4 minute read

Published: December 23, 2019

The problem is simple. How to find the best threshold from an ROC and PR curve that maximise a certain binary classification metric?

Riemann Hypothesis and One Question in My Mind

7 minute read

Published: July 10, 2019

Yesterday I came across an interesting Math paper discussing about the Riemann hypothesis. Regarding the concept itself, there’s lots of maths but I think I enjoyed the reading. Frankly speaking, although mathematics is one of my favourite subjects, I’ve been rarely playing with it (esp. pure maths) since I got acquainted with AI and big data engineering world. Now I think it’s just fine to play with it again. Just for fun.

Tackling Covariate Shift in ML Using ML

2 minute read

Published: March 31, 2020

In the previous post I mentioned about a simple way of estimating the density ratio of two probability distributions. I decided to create a python package that provides such a functionality.

Density Ratio Estimation with Probabilistic Classification for Handling Covariate Shift

4 minute read

Published: March 13, 2020

In the previous post I shared about how to detect covariate shift with a simple technique–model based approach. After knowing that the data distribution changes, what can we do to address such an issue?

Maximum Likelihood Estimation

3 minute read

Published: July 24, 2020

If in the probability context we state that P(x1, x2, ..., xn | params) means the probability of getting a set of observations x1, x2, …, and xn given the distribution parameters, then in the likelihood context we get the following.

The Monty Hall Problem Using Conditional Probability

2 minute read

Published: August 03, 2019

The Monty Hall Problem can be stated as the following:

Kafka Producer Awareness of Topic Metadata (Partitions) Updates

2 minute read

Published: November 04, 2019

In the previous article about Kafka Consumer Awareness of New Topic Partitions, I wrote about partitions balancing by Kafka consumers. In other words, I’d like to see whether Kafka consumers are aware of new topic partitions.

Kafka Consumer Awareness of New Topic Partitions

6 minute read

Published: October 23, 2019

Just wanted to confirm whether the Kafka consumers were aware of new topic’s partitions.

Modifying the Code Profiler to Use Custom sort_stats Sorters

3 minute read

Published: August 02, 2019

Code profiling is simply used to assess the code performance, including its functions and sub-functions within functions. One of its obvious usage is code optimisation where a developer wants to improve the code efficiency by searching for the bottlenecks in the code.

Obfuscation Modes in PyArmor

3 minute read

Published: December 20, 2019

I think one of the unique features provided by PyArmor is that it lets the users to configure the ways to obfuscate the codes.

Obfuscating Python Scripts with PyArmor

11 minute read

Published: December 08, 2019

Basically, code obfuscation is a technique used to modify the source code so that it becomes difficult to understand but remains fully functional. The main objective is to protect intellectual properties and prevent hackers from reverse engineering a proprietary source code.

Adding Strictly Increasing ID to Spark Dataframes

3 minute read

Published: February 28, 2020

Recently I was exploring ways of adding a unique row ID column to a dataframe. The requirement is simple: “the row ID should strictly increase with difference of one and the data order is not modified”.

F.col() Behavior With Non-Existing Referred Columns on Dataframe Operations

1 minute read

Published: October 06, 2019

I came across an odd use case when applying F.col() on certain dataframe operations on PySpark v.2.4.0. Please note that the context of this issue is on Oct 6, 2019. Such an issue might have been solved in the future.

Crosstab Does Not Yield the Same Result for Different Column Data Types

5 minute read

Published: September 20, 2019

I encountered an issue when applying crosstab function in PySpark to a pretty big data. And I think this should be considered as a pretty big issue. Please note that the context of this issue is on Sep 20, 2019. Such an issue might have been solved in the future.

Failure When Overwriting A Parquet File Might Result in Data Loss

1 minute read

Published: September 10, 2019

There are several critical issues that present when using Spark. One of them relates to data loss when a failure occurs.

Resolving Attributes Data Inconsistency with Union By Name

1 minute read

Published: August 21, 2019

If you read my previous article titled Union Operation After Left-anti Join Might Result in Inconsistent Attributes Data, it was shown that the attributes data was inconsistent when combining two data frames after inner-join. According to the article, the solution is really simple. We just need to reorder the attributes order by using select command. Here’s a simple example.

A Brief Overview of PySpark Logging

1 minute read

Published: February 08, 2019

This article is about a brief overview of how to write log messages using PySpark logging.

Obfuscation Modes in PyArmor

3 minute read

Published: December 20, 2019

I think one of the unique features provided by PyArmor is that it lets the users to configure the ways to obfuscate the codes.

Obfuscating Python Scripts with PyArmor

11 minute read

Published: December 08, 2019

Basically, code obfuscation is a technique used to modify the source code so that it becomes difficult to understand but remains fully functional. The main objective is to protect intellectual properties and prevent hackers from reverse engineering a proprietary source code.

Setting Up & Debugging Airflow On Local Machine

5 minute read

Published: November 22, 2019

Airflow is basically a workflow management system. When we’re talking about “workflow”, we’re referring to a sequence of tasks that needs to be performed to accomplish a certain goal. A simple example would be related to an ordinary ETL job, such as fetching data from data sources, transforming the data into certain formats which in accordance with the requirements, and then storing the transformed data to a data warehouse.

Kafka Consumer Awareness of New Topic Partitions

6 minute read

Published: October 23, 2019

Just wanted to confirm whether the Kafka consumers were aware of new topic’s partitions.

Making mapPartitions Accepts Partition Functions with More Than One Arguments

1 minute read

Published: October 16, 2019

There might be a case where we need to perform a certain operation on each data partition. One of the most common examples is the use of mapPartitions. Sometimes, such an operation probably requires a more complicated procedure. This, in the end, makes the method executing the operation needs more than one parameter.

Sigma Operation in Spark’s Dataframe

1 minute read

Published: August 08, 2019

Have you ever encountered a case where you need to compute the sum of a certain one-item operation? Consider the following example.

Modifying the Code Profiler to Use Custom sort_stats Sorters

3 minute read

Published: August 02, 2019

Code profiling is simply used to assess the code performance, including its functions and sub-functions within functions. One of its obvious usage is code optimisation where a developer wants to improve the code efficiency by searching for the bottlenecks in the code.

Build a Simple Recommendation System using Python

13 minute read

Published: January 06, 2017

Introduction

Python-like to Algorithm Specification

7 minute read

Published: December 07, 2017

An interesting paper: http://www.phontron.com/paper/oda15ase.pdf

Ramanujan’s Nested Cube Roots Proof

6 minute read

Published: June 18, 2020

The theorem of nested cube roots (Ramanujan) states the following.

Making mapPartitions Accepts Partition Functions with More Than One Arguments

1 minute read

Published: October 16, 2019

There might be a case where we need to perform a certain operation on each data partition. One of the most common examples is the use of mapPartitions. Sometimes, such an operation probably requires a more complicated procedure. This, in the end, makes the method executing the operation needs more than one parameter.

Effects of Shuffling on RDDs and Dataframes Partitioning

8 minute read

Published: May 21, 2019

In Spark, data shuffling simply means data movement. In a single machine with multiple partitions, data shuffling means that data move from one partition to another partition. Meanwhile, in multiple machines, data shuffling can have two kinds of work. The first one is data move from one partition (A) to another partition (B) within the same machine (M1), while the second one is data move from partition B to another partition (C) within different machine (M2). Data in partition C might be moved to another partition within different machine again (M3).

Custom Partitioner for Repartitioning in Spark

8 minute read

Published: April 06, 2019

A statement I encountered a few days ago: “Avoid to use Resilient Distributed Datasets (RDDs) and use Dataframes/Datasets (DFs/DTs) instead, especially in production stage”.

RDD to DF Gave a StopIteration Exception

1 minute read

Published: March 06, 2019

I made a silly mistake a few days ago - well, yes.

Build a Simple Recommendation System using Python

13 minute read

Published: January 06, 2017

Introduction

The Levinson-Durbin Recursion Example

4 minute read

Published: January 22, 2021

In the previous post I wrote about how to derive the Levinson-Durbin recursion.

The Levinson-Durbin Recursion Derivation

8 minute read

Published: January 20, 2021

I was exploring about the Levinson-Durbin recursion and curious about the formula’s derivation.

Gradient Boosting Algorithm for Regression Problem

7 minute read

Published: December 01, 2020

In this post, we’re going to look at how Gradient Boosting algorithm works in a regression problem.

Multicollinearity - Large Estimating Betas Variance (Part 2)

4 minute read

Published: April 19, 2020

In the previous post, I mentioned about how collinearity affects the computation of the beta estimators.

Multicollinearity - A Bit of Maths Behind Why It is a Problem (Part 1)

7 minute read

Published: April 15, 2020

In simple terms, we could define collinearity as a condition where two variables are highly correlated (positively / negatively). When there are more than two variables, it’s sometimes referred as multicollinearity.

Speeding Up Window Function by Repartitioning the Dataframe First

4 minute read

Published: May 17, 2019

The concept of window function in Spark is pretty interesting. One of its primary usage is calculating cumulative values. Here’s a simple example.

Repartitioning Input Data Stream

3 minute read

Published: June 11, 2019

Recently I played with a simple Spark Streaming application. Precisely, I investigated the behavior of repartitioning on different level of input data streams. For instance, we have two input data streams, such as linesDStream and wordsDStream. The question is, is the repartitioning result different if I repartition after linesDStream and after wordsDStream?

A Brief Report on GroupBy Operation After Dataframe Repartitioning

10 minute read

Published: April 19, 2019

A few days ago I did a little exploration on Spark’s groupBy behavior. Precisely, I wanted to see whether the order of the data was still preserved when applying groupBy on a repartitioned dataframe.

Custom Partitioner for Repartitioning in Spark

8 minute read

Published: April 06, 2019

A statement I encountered a few days ago: “Avoid to use Resilient Distributed Datasets (RDDs) and use Dataframes/Datasets (DFs/DTs) instead, especially in production stage”.

A Little Experiment on Dataframe Repartitioning

3 minute read

Published: March 29, 2019

Spark has two types of partitioning. The first one is coalesce, while the second one is repartition.

Accessing Resources with Extra Classpath as spark-submit Config

less than 1 minute read

Published: July 17, 2020

A few days ago I came across a case where a module needs access to the resources directory.

Riemann Hypothesis and One Question in My Mind

7 minute read

Published: July 10, 2019

Yesterday I came across an interesting Math paper discussing about the Riemann hypothesis. Regarding the concept itself, there’s lots of maths but I think I enjoyed the reading. Frankly speaking, although mathematics is one of my favourite subjects, I’ve been rarely playing with it (esp. pure maths) since I got acquainted with AI and big data engineering world. Now I think it’s just fine to play with it again. Just for fun.

Drawing ROC Curve Without Applying the Formulas

6 minute read

Published: December 24, 2019

One of the evaluation metrics that is often optimised is ROC-AUC. In this post, we’re going to discuss how an ROC curve is created.

Finding the Best Threshold that Maximizes Accuracy from ROC & PR Curve

4 minute read

Published: December 23, 2019

The problem is simple. How to find the best threshold from an ROC and PR curve that maximise a certain binary classification metric?

Rolling File Appender in PySpark Logging

1 minute read

Published: February 14, 2019

This article is about a brief overview of how to store several of the most recent log files in PySpark logging.

Get Acquainted with Rule-based Machine Translation

7 minute read

Published: January 10, 2017

Introduction

Installing and Executing Rust Code via Docker

1 minute read

Published: August 13, 2019

The best way to try new technologies without having clutter? Docker.

Sample Size is Matter for Mean Difference Testing

1 minute read

Published: September 08, 2020

It’s quite bothering when reading a publication that only provides a “statistically significant” result without telling much about the analysis prior to conducting the experiment.

The Investigation of Skewness & Kurtosis in Spark (Scala)

5 minute read

Published: August 18, 2020

Applying central moment functions in Spark might be tricky, especially for skewness and kurtosis.

Accessing Resources with Extra Classpath as spark-submit Config

less than 1 minute read

Published: July 17, 2020

A few days ago I came across a case where a module needs access to the resources directory.

Retrieving Rows with Duplicate Values on the Columns of Interest in Spark

5 minute read

Published: June 06, 2020

There are several ways of removing duplicate rows in Spark. Two of them are by using distinct() and dropDuplicates(). The former lets us to remove rows with the same values on all the columns. Meanwhile, the latter lets us to remove rows with the same values on multiple selected columns.

Airflow Executor & Friends: How Actually Does the Executor Run the Task?

7 minute read

Published: November 29, 2019

A few days ago I did a small experiment with Airflow. To be precise, scheduling Airflow to run a Spark job via spark-submit to a standalone cluster. I have actually mentioned briefly about how to create a DAG and Operators in the previous post.

Bug on Airflow When Polling Spark Job Status Deployed with Cluster Mode

9 minute read

Published: December 09, 2019

I was thinking of the following case.

The Three-Headed Hound of the Underworld (Kerberos)

6 minute read

Published: January 14, 2020

Kerberos is simply a “ticket-based” authentication protocol. It enhances the security approach used by password-based authentication protocol. Since there might be a possibility for tappers to take over the password, Kerberos mitigates this by leveraging a ticket (how it is generated is explained below) that ideally should only be known by the client and the service.

Obfuscating Python Scripts with PyArmor

11 minute read

Published: December 08, 2019

Basically, code obfuscation is a technique used to modify the source code so that it becomes difficult to understand but remains fully functional. The main objective is to protect intellectual properties and prevent hackers from reverse engineering a proprietary source code.

Resolving Reference Column Ambiguity After Self-Joining by Deep Copying the Dataframes

5 minute read

Published: June 28, 2019

I encountered an intriguing result when joining a dataframe with itself (self-join). As you might have already known, one of the problems occurred when doing a self-join relates to duplicated column names. Because of this duplication, there’s an ambiguity when we do operations requiring us to provide the column names.

Sentiment Analysis for Twitter using WEKA

16 minute read

Published: December 26, 2016

Introduction

Submitting and Polling Spark Job Status with Apache Livy

5 minute read

Published: January 09, 2020

Livy offers a REST interface that is used to interact with Spark cluster. It provides two general approaches for job submission and monitoring.

Level 1. Shooting Star

8 minute read

Published: July 08, 2016

Primary purpose:

Effects of Shuffling on RDDs and Dataframes Partitioning

8 minute read

Published: May 21, 2019

In Spark, data shuffling simply means data movement. In a single machine with multiple partitions, data shuffling means that data move from one partition to another partition. Meanwhile, in multiple machines, data shuffling can have two kinds of work. The first one is data move from one partition (A) to another partition (B) within the same machine (M1), while the second one is data move from partition B to another partition (C) within different machine (M2). Data in partition C might be moved to another partition within different machine again (M3).

Sigma Operation in Spark’s Dataframe

1 minute read

Published: August 08, 2019

Have you ever encountered a case where you need to compute the sum of a certain one-item operation? Consider the following example.

Kalman Filter for Static State & Single Measurement

1 minute read

Published: September 20, 2020

Kalman filter is an iterative mathematical process applied on consecutive data inputs to quickly estimate the true value (position, velocity, weight, temperature, etc) of the object being measured, when the measured values contain random error or uncertainty.

How to Check the Size of a Dataframe?

1 minute read

Published: May 31, 2019

Have you ever wondered how the size of a dataframe can be discovered? Perhaps it sounds not so fancy thing to know, yet I think there are certain cases requiring us to have pre-knowledge of the size of our dataframe. One of them is when we want to apply broadcast operation. As you might’ve already knownn, broadcasting requires the dataframe to be small enough to fit in memory in each executor. This implicitly means that we should know about the size of the dataframe beforehand in order for broadcasting to be applied successfully. Just FYI, broadcasting enables us to configure the maximum size of a dataframe that can be pushed into each executor. Precisely, this maximum size can be configured via spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, MAX_SIZE).

How to Check the Size of a Dataframe?

1 minute read

Published: May 31, 2019

Have you ever wondered how the size of a dataframe can be discovered? Perhaps it sounds not so fancy thing to know, yet I think there are certain cases requiring us to have pre-knowledge of the size of our dataframe. One of them is when we want to apply broadcast operation. As you might’ve already knownn, broadcasting requires the dataframe to be small enough to fit in memory in each executor. This implicitly means that we should know about the size of the dataframe beforehand in order for broadcasting to be applied successfully. Just FYI, broadcasting enables us to configure the maximum size of a dataframe that can be pushed into each executor. Precisely, this maximum size can be configured via spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, MAX_SIZE).

The Investigation of Skewness & Kurtosis in Spark (Scala)

5 minute read

Published: August 18, 2020

Applying central moment functions in Spark might be tricky, especially for skewness and kurtosis.

Synthetic Minority Over-sampling Technique (SMOTE)

3 minute read

Published: January 25, 2019

First article in 2019.

The Investigation of Skewness & Kurtosis in Spark (Scala)

5 minute read

Published: August 18, 2020

Applying central moment functions in Spark might be tricky, especially for skewness and kurtosis.

Accessing Resources with Extra Classpath as spark-submit Config

less than 1 minute read

Published: July 17, 2020

A few days ago I came across a case where a module needs access to the resources directory.

Running Local Mode Spark with Logging via spark-submit

less than 1 minute read

Published: June 30, 2020

Below is a script for running spark via spark-submit (local mode) that utilizes logging.

Retrieving Rows with Duplicate Values on the Columns of Interest in Spark

5 minute read

Published: June 06, 2020

There are several ways of removing duplicate rows in Spark. Two of them are by using distinct() and dropDuplicates(). The former lets us to remove rows with the same values on all the columns. Meanwhile, the latter lets us to remove rows with the same values on multiple selected columns.

Weight of Evidence & Information Value for Attributes Relevance Analysis with PySpark

1 minute read

Published: April 09, 2020

Woe & information value (IV) are used as a framework for attribute relevance analysis. WoE and IV can be utilised independently since each of them play different roles.

Distributed LIME with PySpark UDF vs MMLSpark

1 minute read

Published: January 28, 2020

In the previous post, I wrote about how to make LIME run in pseudo-distributed mode with PySpark UDF.

Pseudo-distributed LIME via PySpark UDF

2 minute read

Published: January 22, 2020

The initial question that popped up in my mind was how to make LIME performs faster. This should be useful enough when the data to explain is big enough.

Multiple Workers in a Single Node Configuration for Spark Standalone Cluster

less than 1 minute read

Published: January 20, 2020

A few days back I tried to set up a spark standalone cluster in my own machine with the following specification: two workers (balanced cores) within a single node.

Submitting and Polling Spark Job Status with Apache Livy

5 minute read

Published: January 09, 2020

Livy offers a REST interface that is used to interact with Spark cluster. It provides two general approaches for job submission and monitoring.

User Sessions Addition Error When Submitting Spark Job to Apache Livy via Local Mode

1 minute read

Published: January 09, 2020

A few days back I tried to submit a Spark job to a Livy server deployed via local mode. The procedure was straightforward since the only thing to do was to specify the job file along with the configuration parameters (like what we do when using spark-submit directly).

Handling Dot Character in Spark Dataframe Column Name (Partial Solution)

1 minute read

Published: January 02, 2020

A few days ago I came across a case where I needed to define a dataframe’s column name with a special character, that is a dot (‘.’). Take a look at thee following schema example.

Creating Nested Columns in PySpark Dataframe

2 minute read

Published: January 02, 2020

A nested column is basically just a column with one or more sub-columns. Take a look at the following example.

Airflow Feature Improvement: Spark Driver Status Polling Support for YARN, Mesos & K8S

1 minute read

Published: December 14, 2019

According to the code base, the driver status tracking feature is only implemented for standalone cluster manager. However, based on this reference, we could also poll the driver status for mesos and kubernetes (cluster deploy mode). Additionally, such a feature is also possible for YARN.

Bug on Airflow When Polling Spark Job Status Deployed with Cluster Mode

9 minute read

Published: December 09, 2019

I was thinking of the following case.

Spark Structured Streaming with Parquet Stream Source & Multiple Stream Queries

3 minute read

Published: November 15, 2019

Whenever we call dataframe.writeStream.start() in structured streaming, Spark creates a new stream that reads from a data source (specified by dataframe.readStream). The data passed through the stream is then processed (if needed) and sinked to a certain location.

Spark History Server: Setting Up & How It Works

2 minute read

Published: November 07, 2019

Application monitoring is critically important, especially when we encounter performance issues. In Spark, one way to monitor a Spark application is via Spark UI. The problem is, this Spark UI can only be accessed when the application is running.

Making mapPartitions Accepts Partition Functions with More Than One Arguments

1 minute read

Published: October 16, 2019

There might be a case where we need to perform a certain operation on each data partition. One of the most common examples is the use of mapPartitions. Sometimes, such an operation probably requires a more complicated procedure. This, in the end, makes the method executing the operation needs more than one parameter.

Structured Streaming Checkpointing with Parquet Stream Source

3 minute read

Published: October 12, 2019

I was curious about how checkpoint files in Spark structured streaming looked like. To introduce the basic concept, checkpointing simply denotes the progress information of streaming process. This checkpoint files are usually used for failure recovery. More detail explanation can be found here.

Sigma Operation in Spark’s Dataframe

1 minute read

Published: August 08, 2019

Have you ever encountered a case where you need to compute the sum of a certain one-item operation? Consider the following example.

Modifying the Code Profiler to Use Custom sort_stats Sorters

3 minute read

Published: August 02, 2019

Code profiling is simply used to assess the code performance, including its functions and sub-functions within functions. One of its obvious usage is code optimisation where a developer wants to improve the code efficiency by searching for the bottlenecks in the code.

Union Operation After Left-anti Join Might Result in Inconsistent Attributes Data

3 minute read

Published: July 24, 2019

Unioning two dataframes after joining them with left_anti? Well, seems like a straightforward approach. However, recently I encountered a case where join operation might shift the location of the join key in the resulting dataframe. This, unfortunately, makes the dataframe’s merging result inconsistent in terms of the data in each attribute.

Resolving Reference Column Ambiguity After Self-Joining by Deep Copying the Dataframes

5 minute read

Published: June 28, 2019

I encountered an intriguing result when joining a dataframe with itself (self-join). As you might have already known, one of the problems occurred when doing a self-join relates to duplicated column names. Because of this duplication, there’s an ambiguity when we do operations requiring us to provide the column names.

The Number of Partitions After Unioning Two or More Dataframes

9 minute read

Published: June 14, 2019

An intriguing question popped into my mind. After unioning several dataframes, how many partitions the resulting dataframe will have?

How to Check the Size of a Dataframe?

1 minute read

Published: May 31, 2019

Have you ever wondered how the size of a dataframe can be discovered? Perhaps it sounds not so fancy thing to know, yet I think there are certain cases requiring us to have pre-knowledge of the size of our dataframe. One of them is when we want to apply broadcast operation. As you might’ve already knownn, broadcasting requires the dataframe to be small enough to fit in memory in each executor. This implicitly means that we should know about the size of the dataframe beforehand in order for broadcasting to be applied successfully. Just FYI, broadcasting enables us to configure the maximum size of a dataframe that can be pushed into each executor. Precisely, this maximum size can be configured via spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, MAX_SIZE).

Effects of Shuffling on RDDs and Dataframes Partitioning

8 minute read

Published: May 21, 2019

In Spark, data shuffling simply means data movement. In a single machine with multiple partitions, data shuffling means that data move from one partition to another partition. Meanwhile, in multiple machines, data shuffling can have two kinds of work. The first one is data move from one partition (A) to another partition (B) within the same machine (M1), while the second one is data move from partition B to another partition (C) within different machine (M2). Data in partition C might be moved to another partition within different machine again (M3).

Speeding Up Window Function by Repartitioning the Dataframe First

4 minute read

Published: May 17, 2019

The concept of window function in Spark is pretty interesting. One of its primary usage is calculating cumulative values. Here’s a simple example.

Speeding Up Parquet Write

3 minute read

Published: May 15, 2019

Parquet is a file format with columnar style. Columnar style means that we don’t store the content of each row of the data. Here’s a simple example.

Too Lazy to Process the Whole Dataframe

2 minute read

Published: May 09, 2019

One of the characteristics of Spark that makes me interested to explore this framework further is its lazy evaluation approach. Simply put, Spark won’t execute the transformation until an action is called. I think it’s logical since when we only specify the transformation plan and don’t ask it to execute the plan, why it needs to force itself to do the computation on the data? In addition, by implementing this lazy evaluation approach, Spark might be able to optimize the logical plan. The task of making the query to be more efficient manually might be reduced significantly. Cool, right?

Lightning Fast Pandas UDF

5 minute read

Published: May 02, 2019

Spark functions (UDFs) are simply functions created to overcome speed performance problem when you want to process a dataframe. It’d be useful when your Python functions were so slow in processing a dataframe in large scale. When you use a Python function, it will process the dataframe with one-row-at-a-time manner, meaning that the process would be executed sequentially. Meanwhile, if you use a Spark UDF, Spark will distribute the dataframe and the Spark UDF to the provided executors. Hence, the dataframe processing would be executed in parallel. For more information about Spark UDF, please take a look at this post.

A Brief Report on GroupBy Operation After Dataframe Repartitioning

10 minute read

Published: April 19, 2019

A few days ago I did a little exploration on Spark’s groupBy behavior. Precisely, I wanted to see whether the order of the data was still preserved when applying groupBy on a repartitioned dataframe.

Custom Partitioner for Repartitioning in Spark

8 minute read

Published: April 06, 2019

A statement I encountered a few days ago: “Avoid to use Resilient Distributed Datasets (RDDs) and use Dataframes/Datasets (DFs/DTs) instead, especially in production stage”.

List of Spark Machine Learning Models & Non-overwritten Prediction Columns

2 minute read

Published: March 30, 2019

I was implementing a paper related to balanced random forest (BRF). Just FYI, a BRF consists of some decision trees where each tree receives instances with a ratio of 1:1 for minority and majority class. A BRF also uses m features selected randomly to determine the best split.

A Little Experiment on Dataframe Repartitioning

3 minute read

Published: March 29, 2019

Spark has two types of partitioning. The first one is coalesce, while the second one is repartition.

Spark Accumulator

1 minute read

Published: March 23, 2019

A few days ago I conducted a little experiment on Spark’s RDD operations. One of them was foreach operation (included as an action). Simply, this operation is applied to each rows in the RDD and the kind of operation applied is specified via a certain function. Here’s a simple example:

RDD to DF Gave a StopIteration Exception

1 minute read

Published: March 06, 2019

I made a silly mistake a few days ago - well, yes.

Rolling File Appender in PySpark Logging

1 minute read

Published: February 14, 2019

This article is about a brief overview of how to store several of the most recent log files in PySpark logging.

Using SparkSQL in Metabase

5 minute read

Published: October 07, 2020

Basically, Metabase’s SparkSQL only allows users to access data in the Hive warehouse. In other words, the data must be in Hive table format to be able to be loaded.

Ensuring Dataframe Partitions After Equi-joining (Inner)

6 minute read

Published: June 13, 2019

The problem is really simple. After equi-joining (inner) two dataframes, a certain operation is applied to each partition. Precisely, such an operation can be accomplished by the following code:

Repartitioning Input Data Stream

3 minute read

Published: June 11, 2019

Recently I played with a simple Spark Streaming application. Precisely, I investigated the behavior of repartitioning on different level of input data streams. For instance, we have two input data streams, such as linesDStream and wordsDStream. The question is, is the repartitioning result different if I repartition after linesDStream and after wordsDStream?

Using SparkSQL in Metabase

5 minute read

Published: October 07, 2020

Basically, Metabase’s SparkSQL only allows users to access data in the Hive warehouse. In other words, the data must be in Hive table format to be able to be loaded.

Stack Frame

5 minute read

Published: July 04, 2016

To discuss about this stack frame, we’ll see from Assembly language point of view.

Java Heap and Stack Memory (Behind the Scene)

7 minute read

Published: January 11, 2017

Introduction

Multiple Workers in a Single Node Configuration for Spark Standalone Cluster

less than 1 minute read

Published: January 20, 2020

A few days back I tried to set up a spark standalone cluster in my own machine with the following specification: two workers (balanced cores) within a single node.

Standard Error of Mean Estimate Derivation

4 minute read

Published: May 06, 2020

Suppose we conduct K experiments on a kind of measurement. On each experiment, we take N observations. In other words, we’ll have N * K data at the end.

Kalman Filter for Static State & Single Measurement

1 minute read

Published: September 20, 2020

Kalman filter is an iterative mathematical process applied on consecutive data inputs to quickly estimate the true value (position, velocity, weight, temperature, etc) of the object being measured, when the measured values contain random error or uncertainty.

Two-sample Kolmogorov-Smirnov Test for Empirical Distributions in Spark

9 minute read

Published: July 15, 2020

Kolmogorov-Smirnov (KS) test is a non-parametric test for the equality of probability distributions.

Statistical Machine Translation - Language Models

8 minute read

Published: January 11, 2017

Introduction

One-sample Z-test with p-value Approach

8 minute read

Published: August 15, 2020

One sample z-test is used to examine whether the difference between a population mean and a certain value is significant.

White Noise Time Series

1 minute read

Published: January 14, 2021

White noise series has the following properties:

Mean equals to zero
Standard deviation is constant
Correlation between lags (lag > 0) is close to zero (each autocorrelation lies within the bound which shows no statistically significant difference from zero)

Kruskal-Wallis Test Statistic Formula Derivation When No Tied Values Exist

7 minute read

Published: December 30, 2020

In the previous post, I mentioned about the general formula of the H statistic is the following (Source: Wikipedia - Kruskal–Wallis one-way analysis of variance):

Hypothesis Testing with the Kruskal-Wallis Test

6 minute read

Published: December 30, 2020

The Kruskal-Wallis test is a non-parametric statistical test that is used to evaluate whether the medians of two or more groups are different. Since the test is non-parametric, it doesn’t assume that the data comes from a particular distribution.

Parzen Window Density Estimation - Kernel and Bandwidth

3 minute read

Published: December 11, 2020

Imagine that you have some data x1, x2, x3, ..., xn originating from an unknown continuous distribution f. You’d like to estimate f.

Kalman Filter for Dynamic State & Multiple Measurements

2 minute read

Published: October 03, 2020

In the previous post, we discuss about the implementation of Kalman filter for static state (the true value of the object’s states are constant over time). In addition, the Kalman filter algorithm is applied to estimate single true value.

Kalman Filter for Static State & Single Measurement

1 minute read

Published: September 20, 2020

Kalman filter is an iterative mathematical process applied on consecutive data inputs to quickly estimate the true value (position, velocity, weight, temperature, etc) of the object being measured, when the measured values contain random error or uncertainty.

Sample Size is Matter for Mean Difference Testing

1 minute read

Published: September 08, 2020

It’s quite bothering when reading a publication that only provides a “statistically significant” result without telling much about the analysis prior to conducting the experiment.

Data Dredging (p-hacking)

3 minute read

Published: September 08, 2020

“If you torture the data long enough, it will confess to anything” - Ronald Coase.

Moment Generating Function

2 minute read

Published: August 25, 2020

As the name suggests, moment generating function (MGF) provides a function that generates moments, such as E[X], E[X^2], E[X^3], and so forth.

The Investigation of Skewness & Kurtosis in Spark (Scala)

5 minute read

Published: August 18, 2020

Applying central moment functions in Spark might be tricky, especially for skewness and kurtosis.

One-sample Z-test with p-value Approach

8 minute read

Published: August 15, 2020

One sample z-test is used to examine whether the difference between a population mean and a certain value is significant.

Maximum Likelihood Estimation - Normal Distribution

3 minute read

Published: August 08, 2020

In the previous post, I mentioned about the basic concept of maximum likelihood estimation (MLE). Please visit the post if you need a refresher.

Maximum Likelihood Estimation

3 minute read

Published: July 24, 2020

If in the probability context we state that P(x1, x2, ..., xn | params) means the probability of getting a set of observations x1, x2, …, and xn given the distribution parameters, then in the likelihood context we get the following.

Kullback-Leibler Divergence for Empirical Probability Distributions in Spark

7 minute read

Published: July 19, 2020

In the previous post, I mentioned about the basic concept of two-sample Kolmogorov-Smirnov (KS) test and its implementation in Spark (Scala API).

Standard Error of Mean Estimate Derivation

4 minute read

Published: May 06, 2020

Suppose we conduct K experiments on a kind of measurement. On each experiment, we take N observations. In other words, we’ll have N * K data at the end.

Bug on Airflow When Polling Spark Job Status Deployed with Cluster Mode

9 minute read

Published: December 09, 2019

I was thinking of the following case.

RDD to DF Gave a StopIteration Exception

1 minute read

Published: March 06, 2019

I made a silly mistake a few days ago - well, yes.

WTF is Kafka? A High-level Overview

7 minute read

Published: February 21, 2019

Basically, you can presume Kafka as a messaging system. When an application sends a message to another application, one thing they need to do is to specify how to send the message. The most obvious use case in using a messaging system, in my opinion, is when we’re dealing with big data. For instance, a sender application shares a large amount of data that need to be processed by a receiver application. However, the processing rate by the receiver is lower than the sending rate. Consequently, the receiver might be overloaded since it’s unable to receive messages anymore while the processing is running. Although we’re using distributed receivers, we still have to tell the sender about which receiver node it should send the message to.

Incremental Query for Large Streaming Data Operation

4 minute read

Published: February 22, 2020

In the previous post, I wrote about how to perform pandas groupBy operation on a large dataset in streaming way. The main problem being addressed is optimum memory consumption since the data size might be extremely large.

Streaming GroupBy for Large Datasets with Pandas

7 minute read

Published: February 15, 2020

I came across an article about how to perform groupBy operation for large dataset. Long story short, the author proposes an approach called streaming groupBy where the dataset is divided into chunks and the groupBy operation is applied to each chunk. This approach is implemented with pandas.

Spark Structured Streaming with Parquet Stream Source & Multiple Stream Queries

3 minute read

Published: November 15, 2019

Whenever we call dataframe.writeStream.start() in structured streaming, Spark creates a new stream that reads from a data source (specified by dataframe.readStream). The data passed through the stream is then processed (if needed) and sinked to a certain location.

Structured Streaming Checkpointing with Parquet Stream Source

3 minute read

Published: October 12, 2019

I was curious about how checkpoint files in Spark structured streaming looked like. To introduce the basic concept, checkpointing simply denotes the progress information of streaming process. This checkpoint files are usually used for failure recovery. More detail explanation can be found here.

Support Vector Machine - Part 3 (Final) - Finding the Optimal Hyperplane

5 minute read

Published: January 16, 2017

Introduction

Support Vector Machine - Part 2 - Computing the Margin

3 minute read

Published: January 14, 2017

Introduction

Support Vector Machine - Overview

4 minute read

Published: January 13, 2017

Introduction

Local Interpretable Model-Agnostic Explanations (LIME)

5 minute read

Published: December 13, 2020

LIME is a python library used to explain predictions from any machine learning classifier.

Airflow Executor & Friends: How Actually Does the Executor Run the Task?

7 minute read

Published: November 29, 2019

A few days ago I did a small experiment with Airflow. To be precise, scheduling Airflow to run a Spark job via spark-submit to a standalone cluster. I have actually mentioned briefly about how to create a DAG and Operators in the previous post.

Finding the Best Threshold that Maximizes Accuracy from ROC & PR Curve

4 minute read

Published: December 23, 2019

The problem is simple. How to find the best threshold from an ROC and PR curve that maximise a certain binary classification metric?

White Noise Time Series

1 minute read

Published: January 14, 2021

White noise series has the following properties:

Mean equals to zero
Standard deviation is constant
Correlation between lags (lag > 0) is close to zero (each autocorrelation lies within the bound which shows no statistically significant difference from zero)

The Levinson-Durbin Recursion Example

4 minute read

Published: January 22, 2021

In the previous post I wrote about how to derive the Levinson-Durbin recursion.

The Levinson-Durbin Recursion Derivation

8 minute read

Published: January 20, 2021

I was exploring about the Levinson-Durbin recursion and curious about the formula’s derivation.

Kafka Consumer Awareness of New Topic Partitions

6 minute read

Published: October 23, 2019

Just wanted to confirm whether the Kafka consumers were aware of new topic’s partitions.

Tree Parzen Estimator in Bayesian Optimization for Hyperparameter Tuning

3 minute read

Published: December 03, 2020

One of the techniques in hyperparameter tuning is called Bayesian Optimization. It selects the next hyperparameter to evaluate based on the previous trials.

Vieta Triple Roots: 2019 American Invitational Mathematics Examination (AIME) I Problem 10

4 minute read

Published: June 07, 2020

Let’s take a look at the problem statement.

Drawing ROC Curve Without Applying the Formulas

6 minute read

Published: December 24, 2019

One of the evaluation metrics that is often optimised is ROC-AUC. In this post, we’re going to discuss how an ROC curve is created.

Sentiment Analysis for Twitter using WEKA

16 minute read

Published: December 26, 2016

Introduction

Pseudo-distributed LIME via PySpark UDF

2 minute read

Published: January 22, 2020

The initial question that popped up in my mind was how to make LIME performs faster. This should be useful enough when the data to explain is big enough.

Kalman Filter for Dynamic State & Multiple Measurements

2 minute read

Published: October 03, 2020

In the previous post, we discuss about the implementation of Kalman filter for static state (the true value of the object’s states are constant over time). In addition, the Kalman filter algorithm is applied to estimate single true value.

Kalman Filter for Static State & Single Measurement

1 minute read

Published: September 20, 2020

Kalman filter is an iterative mathematical process applied on consecutive data inputs to quickly estimate the true value (position, velocity, weight, temperature, etc) of the object being measured, when the measured values contain random error or uncertainty.

Union Operation After Left-anti Join Might Result in Inconsistent Attributes Data

3 minute read

Published: July 24, 2019

Unioning two dataframes after joining them with left_anti? Well, seems like a straightforward approach. However, recently I encountered a case where join operation might shift the location of the join key in the resulting dataframe. This, unfortunately, makes the dataframe’s merging result inconsistent in terms of the data in each attribute.

The Number of Partitions After Unioning Two or More Dataframes

9 minute read

Published: June 14, 2019

An intriguing question popped into my mind. After unioning several dataframes, how many partitions the resulting dataframe will have?

Resolving Attributes Data Inconsistency with Union By Name

1 minute read

Published: August 21, 2019

If you read my previous article titled Union Operation After Left-anti Join Might Result in Inconsistent Attributes Data, it was shown that the attributes data was inconsistent when combining two data frames after inner-join. According to the article, the solution is really simple. We just need to reorder the attributes order by using select command. Here’s a simple example.

Adding Strictly Increasing ID to Spark Dataframes

3 minute read

Published: February 28, 2020

Recently I was exploring ways of adding a unique row ID column to a dataframe. The requirement is simple: “the row ID should strictly increase with difference of one and the data order is not modified”.

Multicollinearity - Large Estimating Betas Variance (Part 2)

4 minute read

Published: April 19, 2020

In the previous post, I mentioned about how collinearity affects the computation of the beta estimators.

Vieta Triple Roots: 2019 American Invitational Mathematics Examination (AIME) I Problem 10

4 minute read

Published: June 07, 2020

Let’s take a look at the problem statement.

Vieta’s Infinite Products Representation for Pi: Show Me the Proof!

8 minute read

Published: May 23, 2020

Last time I wrote about the infinite products representation for pi that is regarded as the Wallis’ product for pi.

Proof of Wallis Product for Pi with Euler’s Infinite Product for Sine

2 minute read

Published: May 22, 2020

In the previous post I mentioned about how to demonstrate the Wallis product for pi by starting from the powered sine integration.

Wallis Product for Pi with Integration: Want the Proof!

6 minute read

Published: May 21, 2020

The Wallis’ infinite product for pi states the following.

Monotonic Binning for Weight of Evidence (WoE) Encoding

3 minute read

Published: April 25, 2020

I was experimenting with the weight of evidence (WoE) encoding for continuous data. The preparation is quite different from categorical data in terms of binning characteristics.

Weight of Evidence & Information Value for Attributes Relevance Analysis with PySpark

1 minute read

Published: April 09, 2020

Woe & information value (IV) are used as a framework for attribute relevance analysis. WoE and IV can be utilised independently since each of them play different roles.

Sentiment Analysis for Twitter using WEKA

16 minute read

Published: December 26, 2016

Introduction

White Noise Time Series

1 minute read

Published: January 14, 2021

White noise series has the following properties:

Mean equals to zero
Standard deviation is constant
Correlation between lags (lag > 0) is close to zero (each autocorrelation lies within the bound which shows no statistically significant difference from zero)

Retrieving Rows with Duplicate Values on the Columns of Interest in Spark

5 minute read

Published: June 06, 2020

There are several ways of removing duplicate rows in Spark. Two of them are by using distinct() and dropDuplicates(). The former lets us to remove rows with the same values on all the columns. Meanwhile, the latter lets us to remove rows with the same values on multiple selected columns.

Speeding Up Window Function by Repartitioning the Dataframe First

4 minute read

Published: May 17, 2019

The concept of window function in Spark is pretty interesting. One of its primary usage is calculating cumulative values. Here’s a simple example.

Multiple Workers in a Single Node Configuration for Spark Standalone Cluster

less than 1 minute read

Published: January 20, 2020

A few days back I tried to set up a spark standalone cluster in my own machine with the following specification: two workers (balanced cores) within a single node.

Airflow Executor & Friends: How Actually Does the Executor Run the Task?

7 minute read

Published: November 29, 2019

A few days ago I did a small experiment with Airflow. To be precise, scheduling Airflow to run a Spark job via spark-submit to a standalone cluster. I have actually mentioned briefly about how to create a DAG and Operators in the previous post.

Setting Up & Debugging Airflow On Local Machine

5 minute read

Published: November 22, 2019

Airflow is basically a workflow management system. When we’re talking about “workflow”, we’re referring to a sequence of tasks that needs to be performed to accomplish a certain goal. A simple example would be related to an ordinary ETL job, such as fetching data from data sources, transforming the data into certain formats which in accordance with the requirements, and then storing the transformed data to a data warehouse.

XGBoost Algorithm for Classification Problem

12 minute read

Published: February 11, 2021

Let’s use a simple example data to demonstrate how XGBoost algorithm works on a classification problem.

One-sample Z-test with p-value Approach

8 minute read

Published: August 15, 2020

One sample z-test is used to examine whether the difference between a population mean and a certain value is significant.

Sample Size is Matter for Mean Difference Testing

1 minute read

Published: September 08, 2020

It’s quite bothering when reading a publication that only provides a “statistically significant” result without telling much about the analysis prior to conducting the experiment.

One-sample Z-test with p-value Approach

8 minute read

Published: August 15, 2020

One sample z-test is used to examine whether the difference between a population mean and a certain value is significant.

Riemann Hypothesis and One Question in My Mind

7 minute read

Published: July 10, 2019

Yesterday I came across an interesting Math paper discussing about the Riemann hypothesis. Regarding the concept itself, there’s lots of maths but I think I enjoyed the reading. Frankly speaking, although mathematics is one of my favourite subjects, I’ve been rarely playing with it (esp. pure maths) since I got acquainted with AI and big data engineering world. Now I think it’s just fine to play with it again. Just for fun.

Albertus Kelvin

Posts by Tags

accumulator

accuracy

adminer

ah_choo!

aime

airflow

algebra

algorithm

am-gm

apache griffin

assembly

attribute relevance analysis

authentication

balanced random forest

batches

bayes

bayesian optimization

big data

bigquery

borderline smote01

buffer lab

Purpose:

buffer overflow

calculus

cap theorem

cassandra

central moments

classification

cluster

cluster managers

collaborative filtering

Introduction

column

compute the margin

Introduction

computer organization and architecture

Purpose:

conditional probability

consumer

contradiction

control theory

covariate shift

crosstab

cumulative distribution function

custom partitioner

custom stats sorters

data distribution

data dredging

data quality

data science

data sorting

database

dataframe