Posts by Tags

accumulator

Spark Accumulator

1 minute read

Published:

A few days ago I conducted a little experiment on Spark’s RDD operations. One of them was foreach operation (included as an action). Simply, this operation is applied to each rows in the RDD and the kind of operation applied is specified via a certain function. Here’s a simple example:

accuracy

adminer

ah_choo!

aime

airflow

Setting Up & Debugging Airflow On Local Machine

5 minute read

Published:

Airflow is basically a workflow management system. When we’re talking about “workflow”, we’re referring to a sequence of tasks that needs to be performed to accomplish a certain goal. A simple example would be related to an ordinary ETL job, such as fetching data from data sources, transforming the data into certain formats which in accordance with the requirements, and then storing the transformed data to a data warehouse.

algebra

algorithm

am-gm

IMO 2012 Problem 2 - Solution

4 minute read

Published:

Let’s play with the 2nd problem of the International Mathematics Olympiad (IMO) 2012.

apache griffin

assembly

Level 2. Hi~

14 minute read

Published:

Primary purpose:

Stack Frame

5 minute read

Published:

To discuss about this stack frame, we’ll see from Assembly language point of view.

What is Buffer Overflow?

1 minute read

Published:

Buffer Overflow is one of code’s exploitation technique which uses buffer weakness. In addition, buffer is a block or space for saving datas.

attribute relevance analysis

authentication

The Three-Headed Hound of the Underworld (Kerberos)

6 minute read

Published:

Kerberos is simply a “ticket-based” authentication protocol. It enhances the security approach used by password-based authentication protocol. Since there might be a possibility for tappers to take over the password, Kerberos mitigates this by leveraging a ticket (how it is generated is explained below) that ideally should only be known by the client and the service.

balanced random forest

Implementing Balanced Random Forest via imblearn

3 minute read

Published:

Have you ever heard of imblearn package? Based on its name, I think people who are familiar with machine learning are going to presume that it’s a package specifically created for tackling the problem of imbalanced data. If you do a deeper search, you’re gonna find its GitHub repository here. And yes, once again, it’s a Python package for playing with imbalanced data.

batches

bayes

bayesian optimization

big data

Incremental Query for Large Streaming Data Operation

4 minute read

Published:

In the previous post, I wrote about how to perform pandas groupBy operation on a large dataset in streaming way. The main problem being addressed is optimum memory consumption since the data size might be extremely large.

Streaming GroupBy for Large Datasets with Pandas

7 minute read

Published:

I came across an article about how to perform groupBy operation for large dataset. Long story short, the author proposes an approach called streaming groupBy where the dataset is divided into chunks and the groupBy operation is applied to each chunk. This approach is implemented with pandas.

Apache Cassandra: Begins with Docker

2 minute read

Published:

This article is about how to install Cassandra and play with several of its query languages. To accomplish that, I’m going to utilize Docker.

bigquery

Permanent and Temporary External Table in BigQuery

6 minute read

Published:

In BigQuery, an external data source is a data source that we can query directly although the data is not stored in BigQuery’s storage. We can query the data source just by creating an external table that refers to the data source instead of loading it to BigQuery.

borderline smote01

buffer lab

Level 2. Hi~

14 minute read

Published:

Primary purpose:

Buffer Lab

5 minute read

Published:

Purpose:

buffer overflow

Examples of Buffer Overflow Attack

4 minute read

Published:

In the earlier section we have learnt a bit about buffer overflow technique. The primary concept is flooding the stack frame with input exceeding the buffer limit so that we can manipulate any datas saved on the stack frame. Some things that can be done using this technique are change the return address so that the attackers can call any functions they want, change the content of variables so that the function executes corresponding code, or change the return value of a function.

Stack Frame

5 minute read

Published:

To discuss about this stack frame, we’ll see from Assembly language point of view.

What is Buffer Overflow?

1 minute read

Published:

Buffer Overflow is one of code’s exploitation technique which uses buffer weakness. In addition, buffer is a block or space for saving datas.

calculus

cap theorem

CAP Theorem

3 minute read

Published:

“Consistency, Availability, and Partition Tolerance” - choose two.

cassandra

Apache Cassandra: Begins with Docker

2 minute read

Published:

This article is about how to install Cassandra and play with several of its query languages. To accomplish that, I’m going to utilize Docker.

central moments

classification

cluster

cluster managers

collaborative filtering

column

compute the margin

computer organization and architecture

Level 2. Hi~

14 minute read

Published:

Primary purpose:

Buffer Lab

5 minute read

Published:

Purpose:

Examples of Buffer Overflow Attack

4 minute read

Published:

In the earlier section we have learnt a bit about buffer overflow technique. The primary concept is flooding the stack frame with input exceeding the buffer limit so that we can manipulate any datas saved on the stack frame. Some things that can be done using this technique are change the return address so that the attackers can call any functions they want, change the content of variables so that the function executes corresponding code, or change the return value of a function.

Stack Frame

5 minute read

Published:

To discuss about this stack frame, we’ll see from Assembly language point of view.

What is Buffer Overflow?

1 minute read

Published:

Buffer Overflow is one of code’s exploitation technique which uses buffer weakness. In addition, buffer is a block or space for saving datas.

conditional probability

consumer

contradiction

The Legendary Question Six IMO 1988

9 minute read

Published:

The final problem of the International Mathematics Olympiad (IMO) 1988 is considered to be the most difficult problem on the contest.

control theory

Kalman Filter for Dynamic State & Multiple Measurements

2 minute read

Published:

In the previous post, we discuss about the implementation of Kalman filter for static state (the true value of the object’s states are constant over time). In addition, the Kalman filter algorithm is applied to estimate single true value.

Kalman Filter for Static State & Single Measurement

1 minute read

Published:

Kalman filter is an iterative mathematical process applied on consecutive data inputs to quickly estimate the true value (position, velocity, weight, temperature, etc) of the object being measured, when the measured values contain random error or uncertainty.

covariate shift

Tackling Covariate Shift in ML Using ML

2 minute read

Published:

In the previous post I mentioned about a simple way of estimating the density ratio of two probability distributions. I decided to create a python package that provides such a functionality.

crosstab

Crosstab Does Not Yield the Same Result for Different Column Data Types

5 minute read

Published:

I encountered an issue when applying crosstab function in PySpark to a pretty big data. And I think this should be considered as a pretty big issue. Please note that the context of this issue is on Sep 20, 2019. Such an issue might have been solved in the future.

cumulative distribution function

custom partitioner

custom stats sorters

Modifying the Code Profiler to Use Custom sort_stats Sorters

3 minute read

Published:

Code profiling is simply used to assess the code performance, including its functions and sub-functions within functions. One of its obvious usage is code optimisation where a developer wants to improve the code efficiency by searching for the bottlenecks in the code.

data distribution

data dredging

Data Dredging (p-hacking)

3 minute read

Published:

“If you torture the data long enough, it will confess to anything” - Ronald Coase.

data quality

Data Quality with Apache Griffin Overview

4 minute read

Published:

A few days back I was exploring a big data quality tool called Griffin. There are lots of DQ tools out there, such as Deequ, Target’s data validator, Tensorflow data validator, PySpark Owl, and Great Expectation. There’s another one called Cerberus. It doesn’t natively support large-scale data however.

data science

Hypothesis Testing with the Kruskal-Wallis Test

6 minute read

Published:

The Kruskal-Wallis test is a non-parametric statistical test that is used to evaluate whether the medians of two or more groups are different. Since the test is non-parametric, it doesn’t assume that the data comes from a particular distribution.

data sorting

database

dataframe

Adding Strictly Increasing ID to Spark Dataframes

3 minute read

Published:

Recently I was exploring ways of adding a unique row ID column to a dataframe. The requirement is simple: “the row ID should strictly increase with difference of one and the data order is not modified”.

Crosstab Does Not Yield the Same Result for Different Column Data Types

5 minute read

Published:

I encountered an issue when applying crosstab function in PySpark to a pretty big data. And I think this should be considered as a pretty big issue. Please note that the context of this issue is on Sep 20, 2019. Such an issue might have been solved in the future.

Sigma Operation in Spark’s Dataframe

1 minute read

Published:

Have you ever encountered a case where you need to compute the sum of a certain one-item operation? Consider the following example.

Union Operation After Left-anti Join Might Result in Inconsistent Attributes Data

3 minute read

Published:

Unioning two dataframes after joining them with left_anti? Well, seems like a straightforward approach. However, recently I encountered a case where join operation might shift the location of the join key in the resulting dataframe. This, unfortunately, makes the dataframe’s merging result inconsistent in terms of the data in each attribute.

Resolving Reference Column Ambiguity After Self-Joining by Deep Copying the Dataframes

5 minute read

Published:

I encountered an intriguing result when joining a dataframe with itself (self-join). As you might have already known, one of the problems occurred when doing a self-join relates to duplicated column names. Because of this duplication, there’s an ambiguity when we do operations requiring us to provide the column names.

How to Check the Size of a Dataframe?

1 minute read

Published:

Have you ever wondered how the size of a dataframe can be discovered? Perhaps it sounds not so fancy thing to know, yet I think there are certain cases requiring us to have pre-knowledge of the size of our dataframe. One of them is when we want to apply broadcast operation. As you might’ve already knownn, broadcasting requires the dataframe to be small enough to fit in memory in each executor. This implicitly means that we should know about the size of the dataframe beforehand in order for broadcasting to be applied successfully. Just FYI, broadcasting enables us to configure the maximum size of a dataframe that can be pushed into each executor. Precisely, this maximum size can be configured via spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, MAX_SIZE).

Effects of Shuffling on RDDs and Dataframes Partitioning

8 minute read

Published:

In Spark, data shuffling simply means data movement. In a single machine with multiple partitions, data shuffling means that data move from one partition to another partition. Meanwhile, in multiple machines, data shuffling can have two kinds of work. The first one is data move from one partition (A) to another partition (B) within the same machine (M1), while the second one is data move from partition B to another partition (C) within different machine (M2). Data in partition C might be moved to another partition within different machine again (M3).

Speeding Up Parquet Write

3 minute read

Published:

Parquet is a file format with columnar style. Columnar style means that we don’t store the content of each row of the data. Here’s a simple example.

Too Lazy to Process the Whole Dataframe

2 minute read

Published:

One of the characteristics of Spark that makes me interested to explore this framework further is its lazy evaluation approach. Simply put, Spark won’t execute the transformation until an action is called. I think it’s logical since when we only specify the transformation plan and don’t ask it to execute the plan, why it needs to force itself to do the computation on the data? In addition, by implementing this lazy evaluation approach, Spark might be able to optimize the logical plan. The task of making the query to be more efficient manually might be reduced significantly. Cool, right?

Custom Partitioner for Repartitioning in Spark

8 minute read

Published:

A statement I encountered a few days ago: “Avoid to use Resilient Distributed Datasets (RDDs) and use Dataframes/Datasets (DFs/DTs) instead, especially in production stage”.

List of Spark Machine Learning Models & Non-overwritten Prediction Columns

2 minute read

Published:

I was implementing a paper related to balanced random forest (BRF). Just FYI, a BRF consists of some decision trees where each tree receives instances with a ratio of 1:1 for minority and majority class. A BRF also uses m features selected randomly to determine the best split.

david hilbert

The Infinite Hotel Paradox by David Hilbert

2 minute read

Published:

Recently I watched a YouTube video about the infinite hotel paradox which was introduced in 1920s by a German mathematician, David Hilbert. In case you’re curious about he video, just search on YouTube using “The Infinite Hotel Paradox” keyword.

deep copy

Resolving Reference Column Ambiguity After Self-Joining by Deep Copying the Dataframes

5 minute read

Published:

I encountered an intriguing result when joining a dataframe with itself (self-join). As you might have already known, one of the problems occurred when doing a self-join relates to duplicated column names. Because of this duplication, there’s an ambiguity when we do operations requiring us to provide the column names.

density estimation

density ratio

density ratio estimation

Tackling Covariate Shift in ML Using ML

2 minute read

Published:

In the previous post I mentioned about a simple way of estimating the density ratio of two probability distributions. I decided to create a python package that provides such a functionality.

distributed

Pseudo-distributed LIME via PySpark UDF

2 minute read

Published:

The initial question that popped up in my mind was how to make LIME performs faster. This should be useful enough when the data to explain is big enough.

distributed computing

Lightning Fast Pandas UDF

5 minute read

Published:

Spark functions (UDFs) are simply functions created to overcome speed performance problem when you want to process a dataframe. It’d be useful when your Python functions were so slow in processing a dataframe in large scale. When you use a Python function, it will process the dataframe with one-row-at-a-time manner, meaning that the process would be executed sequentially. Meanwhile, if you use a Spark UDF, Spark will distribute the dataframe and the Spark UDF to the provided executors. Hence, the dataframe processing would be executed in parallel. For more information about Spark UDF, please take a look at this post.

distributed system

CAP Theorem

3 minute read

Published:

“Consistency, Availability, and Partition Tolerance” - choose two.

distribution

Maximum Likelihood Estimation

3 minute read

Published:

If in the probability context we state that P(x1, x2, ..., xn | params) means the probability of getting a set of observations x1, x2, …, and xn given the distribution parameters, then in the likelihood context we get the following.

docker

Apache Cassandra: Begins with Docker

2 minute read

Published:

This article is about how to install Cassandra and play with several of its query languages. To accomplish that, I’m going to utilize Docker.

dot character

dream

When GOD Granted That Opportunity: Part 1

8 minute read

Published:

On November 15th, 2018, I promised myself I would write down my journey of accomplishing one of my dreams. This post is the realization of that word.

driver status

duplicates

Retrieving Rows with Duplicate Values on the Columns of Interest in Spark

5 minute read

Published:

There are several ways of removing duplicate rows in Spark. Two of them are by using distinct() and dropDuplicates(). The former lets us to remove rows with the same values on all the columns. Meanwhile, the latter lets us to remove rows with the same values on multiple selected columns.

durability

dynamic state

Kalman Filter for Dynamic State & Multiple Measurements

2 minute read

Published:

In the previous post, we discuss about the implementation of Kalman filter for static state (the true value of the object’s states are constant over time). In addition, the Kalman filter algorithm is applied to estimate single true value.

empirical distribution

ensemble

euclid proof

Infinitely Many Prime Numbers by Euclid

1 minute read

Published:

To me, prime numbers are really interesting in terms of their position as the building blocks of other numbers. According to the Fundamental Theorem of Arithmetic, every positive integer N can be written as a product of P1, P2, P3, …, and Pk where Pi are all prime numbers.

euler

euler identity

executor

explainable ai

explanation

external table

Permanent and Temporary External Table in BigQuery

6 minute read

Published:

In BigQuery, an external data source is a data source that we can query directly although the data is not stored in BigQuery’s storage. We can query the data source just by creating an external table that refers to the data source instead of loading it to BigQuery.

extra classpath

extreme algebra

extreme gradient boosting

false positive rate

find optimal hyperplane

functions

Sigma Operation in Spark’s Dataframe

1 minute read

Published:

Have you ever encountered a case where you need to compute the sum of a certain one-item operation? Consider the following example.

gcp

Permanent and Temporary External Table in BigQuery

6 minute read

Published:

In BigQuery, an external data source is a data source that we can query directly although the data is not stored in BigQuery’s storage. We can query the data source just by creating an external table that refers to the data source instead of loading it to BigQuery.

good night like yesterday

gradient boosting

griffin

Data Quality with Apache Griffin Overview

4 minute read

Published:

A few days back I was exploring a big data quality tool called Griffin. There are lots of DQ tools out there, such as Deequ, Target’s data validator, Tensorflow data validator, PySpark Owl, and Great Expectation. There’s another one called Cerberus. It doesn’t natively support large-scale data however.

groupby

Retrieving Rows with Duplicate Values on the Columns of Interest in Spark

5 minute read

Published:

There are several ways of removing duplicate rows in Spark. Two of them are by using distinct() and dropDuplicates(). The former lets us to remove rows with the same values on all the columns. Meanwhile, the latter lets us to remove rows with the same values on multiple selected columns.

Streaming GroupBy for Large Datasets with Pandas

7 minute read

Published:

I came across an article about how to perform groupBy operation for large dataset. Long story short, the author proposes an approach called streaming groupBy where the dataset is divided into chunks and the groupBy operation is applied to each chunk. This approach is implemented with pandas.

Crosstab Does Not Yield the Same Result for Different Column Data Types

5 minute read

Published:

I encountered an issue when applying crosstab function in PySpark to a pretty big data. And I think this should be considered as a pretty big issue. Please note that the context of this issue is on Sep 20, 2019. Such an issue might have been solved in the future.

h2o

hadoop

Using SparkSQL in Metabase

5 minute read

Published:

Basically, Metabase’s SparkSQL only allows users to access data in the Hive warehouse. In other words, the data must be in Hive table format to be able to be loaded.

hash

heap memory

history server

Spark History Server: Setting Up & How It Works

2 minute read

Published:

Application monitoring is critically important, especially when we encounter performance issues. In Spark, one way to monitor a Spark application is via Spark UI. The problem is, this Spark UI can only be accessed when the application is running.

hive

Using SparkSQL in Metabase

5 minute read

Published:

Basically, Metabase’s SparkSQL only allows users to access data in the Hive warehouse. In other words, the data must be in Hive table format to be able to be loaded.

hi~

Level 2. Hi~

14 minute read

Published:

Primary purpose:

hyperparameter tuning

hypothesis testing

Hypothesis Testing with the Kruskal-Wallis Test

6 minute read

Published:

The Kruskal-Wallis test is a non-parametric statistical test that is used to evaluate whether the medians of two or more groups are different. Since the test is non-parametric, it doesn’t assume that the data comes from a particular distribution.

Sample Size is Matter for Mean Difference Testing

1 minute read

Published:

It’s quite bothering when reading a publication that only provides a “statistically significant” result without telling much about the analysis prior to conducting the experiment.

Data Dredging (p-hacking)

3 minute read

Published:

“If you torture the data long enough, it will confess to anything” - Ronald Coase.

imblearn

Implementing Balanced Random Forest via imblearn

3 minute read

Published:

Have you ever heard of imblearn package? Based on its name, I think people who are familiar with machine learning are going to presume that it’s a package specifically created for tackling the problem of imbalanced data. If you do a deeper search, you’re gonna find its GitHub repository here. And yes, once again, it’s a Python package for playing with imbalanced data.

imo

IMO 2012 Problem 2 - Solution

4 minute read

Published:

Let’s play with the 2nd problem of the International Mathematics Olympiad (IMO) 2012.

The Legendary Question Six IMO 1988

9 minute read

Published:

The final problem of the International Mathematics Olympiad (IMO) 1988 is considered to be the most difficult problem on the contest.

importance weighting

incremental query

Incremental Query for Large Streaming Data Operation

4 minute read

Published:

In the previous post, I wrote about how to perform pandas groupBy operation on a large dataset in streaming way. The main problem being addressed is optimum memory consumption since the data size might be extremely large.

inequation

IMO 2012 Problem 2 - Solution

4 minute read

Published:

Let’s play with the 2nd problem of the International Mathematics Olympiad (IMO) 2012.

infinite hotel

The Infinite Hotel Paradox by David Hilbert

2 minute read

Published:

Recently I watched a YouTube video about the infinite hotel paradox which was introduced in 1920s by a German mathematician, David Hilbert. In case you’re curious about he video, just search on YouTube using “The Infinite Hotel Paradox” keyword.

infinite prime numbers

Infinitely Many Prime Numbers by Euclid

1 minute read

Published:

To me, prime numbers are really interesting in terms of their position as the building blocks of other numbers. According to the Fundamental Theorem of Arithmetic, every positive integer N can be written as a product of P1, P2, P3, …, and Pk where Pi are all prime numbers.

infinity

The Infinite Hotel Paradox by David Hilbert

2 minute read

Published:

Recently I watched a YouTube video about the infinite hotel paradox which was introduced in 1920s by a German mathematician, David Hilbert. In case you’re curious about he video, just search on YouTube using “The Infinite Hotel Paradox” keyword.

information value

input data stream

Repartitioning Input Data Stream

3 minute read

Published:

Recently I played with a simple Spark Streaming application. Precisely, I investigated the behavior of repartitioning on different level of input data streams. For instance, we have two input data streams, such as linesDStream and wordsDStream. The question is, is the repartitioning result different if I repartition after linesDStream and after wordsDStream?

interpretable

java

kafka

WTF is Kafka? A High-level Overview

7 minute read

Published:

Basically, you can presume Kafka as a messaging system. When an application sends a message to another application, one thing they need to do is to specify how to send the message. The most obvious use case in using a messaging system, in my opinion, is when we’re dealing with big data. For instance, a sender application shares a large amount of data that need to be processed by a receiver application. However, the processing rate by the receiver is lower than the sending rate. Consequently, the receiver might be overloaded since it’s unable to receive messages anymore while the processing is running. Although we’re using distributed receivers, we still have to tell the sender about which receiver node it should send the message to.

kalman filter

Kalman Filter for Dynamic State & Multiple Measurements

2 minute read

Published:

In the previous post, we discuss about the implementation of Kalman filter for static state (the true value of the object’s states are constant over time). In addition, the Kalman filter algorithm is applied to estimate single true value.

Kalman Filter for Static State & Single Measurement

1 minute read

Published:

Kalman filter is an iterative mathematical process applied on consecutive data inputs to quickly estimate the true value (position, velocity, weight, temperature, etc) of the object being measured, when the measured values contain random error or uncertainty.

kerberos

The Three-Headed Hound of the Underworld (Kerberos)

6 minute read

Published:

Kerberos is simply a “ticket-based” authentication protocol. It enhances the security approach used by password-based authentication protocol. Since there might be a possibility for tappers to take over the password, Kerberos mitigates this by leveraging a ticket (how it is generated is explained below) that ideally should only be known by the client and the service.

kl divergence

kolmogorov smirnov test

kruskal wallis

Hypothesis Testing with the Kruskal-Wallis Test

6 minute read

Published:

The Kruskal-Wallis test is a non-parametric statistical test that is used to evaluate whether the medians of two or more groups are different. Since the test is non-parametric, it doesn’t assume that the data comes from a particular distribution.

kullback-leibler

kurtosis

language models

lazy evaluation

Too Lazy to Process the Whole Dataframe

2 minute read

Published:

One of the characteristics of Spark that makes me interested to explore this framework further is its lazy evaluation approach. Simply put, Spark won’t execute the transformation until an action is called. I think it’s logical since when we only specify the transformation plan and don’t ask it to execute the plan, why it needs to force itself to do the computation on the data? In addition, by implementing this lazy evaluation approach, Spark might be able to optimize the logical plan. The task of making the query to be more efficient manually might be reduced significantly. Cool, right?

left anti join

Union Operation After Left-anti Join Might Result in Inconsistent Attributes Data

3 minute read

Published:

Unioning two dataframes after joining them with left_anti? Well, seems like a straightforward approach. However, recently I encountered a case where join operation might shift the location of the join key in the resulting dataframe. This, unfortunately, makes the dataframe’s merging result inconsistent in terms of the data in each attribute.

legendary question

The Legendary Question Six IMO 1988

9 minute read

Published:

The final problem of the International Mathematics Olympiad (IMO) 1988 is considered to be the most difficult problem on the contest.

levinson durbin

life journey

When GOD Granted That Opportunity: Part 1

8 minute read

Published:

On November 15th, 2018, I promised myself I would write down my journey of accomplishing one of my dreams. This post is the realization of that word.

lime

Pseudo-distributed LIME via PySpark UDF

2 minute read

Published:

The initial question that popped up in my mind was how to make LIME performs faster. This should be useful enough when the data to explain is big enough.

livy

local

logging

machine learning

Data Quality with Apache Griffin Overview

4 minute read

Published:

A few days back I was exploring a big data quality tool called Griffin. There are lots of DQ tools out there, such as Deequ, Target’s data validator, Tensorflow data validator, PySpark Owl, and Great Expectation. There’s another one called Cerberus. It doesn’t natively support large-scale data however.

Standard Error of Mean Estimate Derivation

4 minute read

Published:

Suppose we conduct K experiments on a kind of measurement. On each experiment, we take N observations. In other words, we’ll have N * K data at the end.

Tackling Covariate Shift in ML Using ML

2 minute read

Published:

In the previous post I mentioned about a simple way of estimating the density ratio of two probability distributions. I decided to create a python package that provides such a functionality.

Implementing Balanced Random Forest via imblearn

3 minute read

Published:

Have you ever heard of imblearn package? Based on its name, I think people who are familiar with machine learning are going to presume that it’s a package specifically created for tackling the problem of imbalanced data. If you do a deeper search, you’re gonna find its GitHub repository here. And yes, once again, it’s a Python package for playing with imbalanced data.

List of Spark Machine Learning Models & Non-overwritten Prediction Columns

2 minute read

Published:

I was implementing a paper related to balanced random forest (BRF). Just FYI, a BRF consists of some decision trees where each tree receives instances with a ratio of 1:1 for minority and majority class. A BRF also uses m features selected randomly to determine the best split.

machine translation

maclaurin series

mappartitions

Making mapPartitions Accepts Partition Functions with More Than One Arguments

1 minute read

Published:

There might be a case where we need to perform a certain operation on each data partition. One of the most common examples is the use of mapPartitions. Sometimes, such an operation probably requires a more complicated procedure. This, in the end, makes the method executing the operation needs more than one parameter.

mathematics

The Legendary Question Six IMO 1988

9 minute read

Published:

The final problem of the International Mathematics Olympiad (IMO) 1988 is considered to be the most difficult problem on the contest.

The Infinite Hotel Paradox by David Hilbert

2 minute read

Published:

Recently I watched a YouTube video about the infinite hotel paradox which was introduced in 1920s by a German mathematician, David Hilbert. In case you’re curious about he video, just search on YouTube using “The Infinite Hotel Paradox” keyword.

maths

IMO 2012 Problem 2 - Solution

4 minute read

Published:

Let’s play with the 2nd problem of the International Mathematics Olympiad (IMO) 2012.

Moment Generating Function

2 minute read

Published:

As the name suggests, moment generating function (MGF) provides a function that generates moments, such as E[X], E[X^2], E[X^3], and so forth.

One-sample Z-test with p-value Approach

8 minute read

Published:

One sample z-test is used to examine whether the difference between a population mean and a certain value is significant.

Riemann Hypothesis and One Question in My Mind

7 minute read

Published:

Yesterday I came across an interesting Math paper discussing about the Riemann hypothesis. Regarding the concept itself, there’s lots of maths but I think I enjoyed the reading. Frankly speaking, although mathematics is one of my favourite subjects, I’ve been rarely playing with it (esp. pure maths) since I got acquainted with AI and big data engineering world. Now I think it’s just fine to play with it again. Just for fun.

maximum likelihood estimation

Maximum Likelihood Estimation

3 minute read

Published:

If in the probability context we state that P(x1, x2, ..., xn | params) means the probability of getting a set of observations x1, x2, …, and xn given the distribution parameters, then in the likelihood context we get the following.

mean

One-sample Z-test with p-value Approach

8 minute read

Published:

One sample z-test is used to examine whether the difference between a population mean and a certain value is significant.

message broker

WTF is Kafka? A High-level Overview

7 minute read

Published:

Basically, you can presume Kafka as a messaging system. When an application sends a message to another application, one thing they need to do is to specify how to send the message. The most obvious use case in using a messaging system, in my opinion, is when we’re dealing with big data. For instance, a sender application shares a large amount of data that need to be processed by a receiver application. However, the processing rate by the receiver is lower than the sending rate. Consequently, the receiver might be overloaded since it’s unable to receive messages anymore while the processing is running. Although we’re using distributed receivers, we still have to tell the sender about which receiver node it should send the message to.

metabase

Using SparkSQL in Metabase

5 minute read

Published:

Basically, Metabase’s SparkSQL only allows users to access data in the Hive warehouse. In other words, the data must be in Hive table format to be able to be loaded.

metadata

mle

Maximum Likelihood Estimation

3 minute read

Published:

If in the probability context we state that P(x1, x2, ..., xn | params) means the probability of getting a set of observations x1, x2, …, and xn given the distribution parameters, then in the likelihood context we get the following.

mmlspark

model agnostic

moment generating function

Moment Generating Function

2 minute read

Published:

As the name suggests, moment generating function (MGF) provides a function that generates moments, such as E[X], E[X^2], E[X^3], and so forth.

moments

Moment Generating Function

2 minute read

Published:

As the name suggests, moment generating function (MGF) provides a function that generates moments, such as E[X], E[X^2], E[X^3], and so forth.

mongodb

monitoring

Spark History Server: Setting Up & How It Works

2 minute read

Published:

Application monitoring is critically important, especially when we encounter performance issues. In Spark, one way to monitor a Spark application is via Spark UI. The problem is, this Spark UI can only be accessed when the application is running.

monotonic binning

monty hall

multicollinearity

multiple measurements

Kalman Filter for Dynamic State & Multiple Measurements

2 minute read

Published:

In the previous post, we discuss about the implementation of Kalman filter for static state (the true value of the object’s states are constant over time). In addition, the Kalman filter algorithm is applied to estimate single true value.

multiple queries

mysql

Using SparkSQL in Metabase

5 minute read

Published:

Basically, Metabase’s SparkSQL only allows users to access data in the Hive warehouse. In other words, the data must be in Hive table format to be able to be loaded.

natural language

natural language processing

nested column

nested cubic roots

non-parametric

Hypothesis Testing with the Kruskal-Wallis Test

6 minute read

Published:

The Kruskal-Wallis test is a non-parametric statistical test that is used to evaluate whether the medians of two or more groups are different. Since the test is non-parametric, it doesn’t assume that the data comes from a particular distribution.

normal distribution

nosql

Apache Cassandra: Begins with Docker

2 minute read

Published:

This article is about how to install Cassandra and play with several of its query languages. To accomplish that, I’m going to utilize Docker.

number theory

The Legendary Question Six IMO 1988

9 minute read

Published:

The final problem of the International Mathematics Olympiad (IMO) 1988 is considered to be the most difficult problem on the contest.

obfuscation

Obfuscation Modes in PyArmor

3 minute read

Published:

I think one of the unique features provided by PyArmor is that it lets the users to configure the ways to obfuscate the codes.

Obfuscating Python Scripts with PyArmor

11 minute read

Published:

Basically, code obfuscation is a technique used to modify the source code so that it becomes difficult to understand but remains fully functional. The main objective is to protect intellectual properties and prevent hackers from reverse engineering a proprietary source code.

olympiad

The Legendary Question Six IMO 1988

9 minute read

Published:

The final problem of the International Mathematics Olympiad (IMO) 1988 is considered to be the most difficult problem on the contest.

oversampling

p-hacking

Data Dredging (p-hacking)

3 minute read

Published:

“If you torture the data long enough, it will confess to anything” - Ronald Coase.

p-value

One-sample Z-test with p-value Approach

8 minute read

Published:

One sample z-test is used to examine whether the difference between a population mean and a certain value is significant.

pandas

Streaming GroupBy for Large Datasets with Pandas

7 minute read

Published:

I came across an article about how to perform groupBy operation for large dataset. Long story short, the author proposes an approach called streaming groupBy where the dataset is divided into chunks and the groupBy operation is applied to each chunk. This approach is implemented with pandas.

pandas udf

Lightning Fast Pandas UDF

5 minute read

Published:

Spark functions (UDFs) are simply functions created to overcome speed performance problem when you want to process a dataframe. It’d be useful when your Python functions were so slow in processing a dataframe in large scale. When you use a Python function, it will process the dataframe with one-row-at-a-time manner, meaning that the process would be executed sequentially. Meanwhile, if you use a Spark UDF, Spark will distribute the dataframe and the Spark UDF to the provided executors. Hence, the dataframe processing would be executed in parallel. For more information about Spark UDF, please take a look at this post.

paradox

The Infinite Hotel Paradox by David Hilbert

2 minute read

Published:

Recently I watched a YouTube video about the infinite hotel paradox which was introduced in 1920s by a German mathematician, David Hilbert. In case you’re curious about he video, just search on YouTube using “The Infinite Hotel Paradox” keyword.

parquet

Speeding Up Parquet Write

3 minute read

Published:

Parquet is a file format with columnar style. Columnar style means that we don’t store the content of each row of the data. Here’s a simple example.

partition

partitionby

Speeding Up Parquet Write

3 minute read

Published:

Parquet is a file format with columnar style. Columnar style means that we don’t store the content of each row of the data. Here’s a simple example.

partitions

Effects of Shuffling on RDDs and Dataframes Partitioning

8 minute read

Published:

In Spark, data shuffling simply means data movement. In a single machine with multiple partitions, data shuffling means that data move from one partition to another partition. Meanwhile, in multiple machines, data shuffling can have two kinds of work. The first one is data move from one partition (A) to another partition (B) within the same machine (M1), while the second one is data move from partition B to another partition (C) within different machine (M2). Data in partition C might be moved to another partition within different machine again (M3).

partitions after join

Ensuring Dataframe Partitions After Equi-joining (Inner)

6 minute read

Published:

The problem is really simple. After equi-joining (inner) two dataframes, a certain operation is applied to each partition. Precisely, such an operation can be accomplished by the following code:

parzen window

percentile

performance debugging

Spark History Server: Setting Up & How It Works

2 minute read

Published:

Application monitoring is critically important, especially when we encounter performance issues. In Spark, one way to monitor a Spark application is via Spark UI. The problem is, this Spark UI can only be accessed when the application is running.

pi

pivot

Crosstab Does Not Yield the Same Result for Different Column Data Types

5 minute read

Published:

I encountered an issue when applying crosstab function in PySpark to a pretty big data. And I think this should be considered as a pretty big issue. Please note that the context of this issue is on Sep 20, 2019. Such an issue might have been solved in the future.

precision recall curve

prime numbers counting

Riemann Hypothesis and One Question in My Mind

7 minute read

Published:

Yesterday I came across an interesting Math paper discussing about the Riemann hypothesis. Regarding the concept itself, there’s lots of maths but I think I enjoyed the reading. Frankly speaking, although mathematics is one of my favourite subjects, I’ve been rarely playing with it (esp. pure maths) since I got acquainted with AI and big data engineering world. Now I think it’s just fine to play with it again. Just for fun.

probabilistic classification

Tackling Covariate Shift in ML Using ML

2 minute read

Published:

In the previous post I mentioned about a simple way of estimating the density ratio of two probability distributions. I decided to create a python package that provides such a functionality.

probability

Maximum Likelihood Estimation

3 minute read

Published:

If in the probability context we state that P(x1, x2, ..., xn | params) means the probability of getting a set of observations x1, x2, …, and xn given the distribution parameters, then in the likelihood context we get the following.

producer

profiling

Modifying the Code Profiler to Use Custom sort_stats Sorters

3 minute read

Published:

Code profiling is simply used to assess the code performance, including its functions and sub-functions within functions. One of its obvious usage is code optimisation where a developer wants to improve the code efficiency by searching for the bottlenecks in the code.

pyarmor

Obfuscation Modes in PyArmor

3 minute read

Published:

I think one of the unique features provided by PyArmor is that it lets the users to configure the ways to obfuscate the codes.

Obfuscating Python Scripts with PyArmor

11 minute read

Published:

Basically, code obfuscation is a technique used to modify the source code so that it becomes difficult to understand but remains fully functional. The main objective is to protect intellectual properties and prevent hackers from reverse engineering a proprietary source code.

pyspark

Adding Strictly Increasing ID to Spark Dataframes

3 minute read

Published:

Recently I was exploring ways of adding a unique row ID column to a dataframe. The requirement is simple: “the row ID should strictly increase with difference of one and the data order is not modified”.

Crosstab Does Not Yield the Same Result for Different Column Data Types

5 minute read

Published:

I encountered an issue when applying crosstab function in PySpark to a pretty big data. And I think this should be considered as a pretty big issue. Please note that the context of this issue is on Sep 20, 2019. Such an issue might have been solved in the future.

python

Obfuscation Modes in PyArmor

3 minute read

Published:

I think one of the unique features provided by PyArmor is that it lets the users to configure the ways to obfuscate the codes.

Obfuscating Python Scripts with PyArmor

11 minute read

Published:

Basically, code obfuscation is a technique used to modify the source code so that it becomes difficult to understand but remains fully functional. The main objective is to protect intellectual properties and prevent hackers from reverse engineering a proprietary source code.

Setting Up & Debugging Airflow On Local Machine

5 minute read

Published:

Airflow is basically a workflow management system. When we’re talking about “workflow”, we’re referring to a sequence of tasks that needs to be performed to accomplish a certain goal. A simple example would be related to an ordinary ETL job, such as fetching data from data sources, transforming the data into certain formats which in accordance with the requirements, and then storing the transformed data to a data warehouse.

Making mapPartitions Accepts Partition Functions with More Than One Arguments

1 minute read

Published:

There might be a case where we need to perform a certain operation on each data partition. One of the most common examples is the use of mapPartitions. Sometimes, such an operation probably requires a more complicated procedure. This, in the end, makes the method executing the operation needs more than one parameter.

Sigma Operation in Spark’s Dataframe

1 minute read

Published:

Have you ever encountered a case where you need to compute the sum of a certain one-item operation? Consider the following example.

Modifying the Code Profiler to Use Custom sort_stats Sorters

3 minute read

Published:

Code profiling is simply used to assess the code performance, including its functions and sub-functions within functions. One of its obvious usage is code optimisation where a developer wants to improve the code efficiency by searching for the bottlenecks in the code.

python programming language

ramanujan

rdd

Making mapPartitions Accepts Partition Functions with More Than One Arguments

1 minute read

Published:

There might be a case where we need to perform a certain operation on each data partition. One of the most common examples is the use of mapPartitions. Sometimes, such an operation probably requires a more complicated procedure. This, in the end, makes the method executing the operation needs more than one parameter.

Effects of Shuffling on RDDs and Dataframes Partitioning

8 minute read

Published:

In Spark, data shuffling simply means data movement. In a single machine with multiple partitions, data shuffling means that data move from one partition to another partition. Meanwhile, in multiple machines, data shuffling can have two kinds of work. The first one is data move from one partition (A) to another partition (B) within the same machine (M1), while the second one is data move from partition B to another partition (C) within different machine (M2). Data in partition C might be moved to another partition within different machine again (M3).

Custom Partitioner for Repartitioning in Spark

8 minute read

Published:

A statement I encountered a few days ago: “Avoid to use Resilient Distributed Datasets (RDDs) and use Dataframes/Datasets (DFs/DTs) instead, especially in production stage”.

recommendation system

recursion

regression

repartition

repartitioning

Repartitioning Input Data Stream

3 minute read

Published:

Recently I played with a simple Spark Streaming application. Precisely, I investigated the behavior of repartitioning on different level of input data streams. For instance, we have two input data streams, such as linesDStream and wordsDStream. The question is, is the repartitioning result different if I repartition after linesDStream and after wordsDStream?

Custom Partitioner for Repartitioning in Spark

8 minute read

Published:

A statement I encountered a few days ago: “Avoid to use Resilient Distributed Datasets (RDDs) and use Dataframes/Datasets (DFs/DTs) instead, especially in production stage”.

resources

riemann hypothesis

Riemann Hypothesis and One Question in My Mind

7 minute read

Published:

Yesterday I came across an interesting Math paper discussing about the Riemann hypothesis. Regarding the concept itself, there’s lots of maths but I think I enjoyed the reading. Frankly speaking, although mathematics is one of my favourite subjects, I’ve been rarely playing with it (esp. pure maths) since I got acquainted with AI and big data engineering world. Now I think it’s just fine to play with it again. Just for fun.

roc curve

rolling file appender

rule-based nlp

rust

sample size

Sample Size is Matter for Mean Difference Testing

1 minute read

Published:

It’s quite bothering when reading a publication that only provides a “statistically significant” result without telling much about the analysis prior to conducting the experiment.

scala

Retrieving Rows with Duplicate Values on the Columns of Interest in Spark

5 minute read

Published:

There are several ways of removing duplicate rows in Spark. Two of them are by using distinct() and dropDuplicates(). The former lets us to remove rows with the same values on all the columns. Meanwhile, the latter lets us to remove rows with the same values on multiple selected columns.

scheduler

scheduling

security

The Three-Headed Hound of the Underworld (Kerberos)

6 minute read

Published:

Kerberos is simply a “ticket-based” authentication protocol. It enhances the security approach used by password-based authentication protocol. Since there might be a possibility for tappers to take over the password, Kerberos mitigates this by leveraging a ticket (how it is generated is explained below) that ideally should only be known by the client and the service.

Obfuscating Python Scripts with PyArmor

11 minute read

Published:

Basically, code obfuscation is a technique used to modify the source code so that it becomes difficult to understand but remains fully functional. The main objective is to protect intellectual properties and prevent hackers from reverse engineering a proprietary source code.

selfjoin

Resolving Reference Column Ambiguity After Self-Joining by Deep Copying the Dataframes

5 minute read

Published:

I encountered an intriguing result when joining a dataframe with itself (self-join). As you might have already known, one of the problems occurred when doing a self-join relates to duplicated column names. Because of this duplication, there’s an ambiguity when we do operations requiring us to provide the column names.

sentiment analysis

sessions

shooting star

shuffle

Effects of Shuffling on RDDs and Dataframes Partitioning

8 minute read

Published:

In Spark, data shuffling simply means data movement. In a single machine with multiple partitions, data shuffling means that data move from one partition to another partition. Meanwhile, in multiple machines, data shuffling can have two kinds of work. The first one is data move from one partition (A) to another partition (B) within the same machine (M1), while the second one is data move from partition B to another partition (C) within different machine (M2). Data in partition C might be moved to another partition within different machine again (M3).

sigma operation

Sigma Operation in Spark’s Dataframe

1 minute read

Published:

Have you ever encountered a case where you need to compute the sum of a certain one-item operation? Consider the following example.

single measurement

Kalman Filter for Static State & Single Measurement

1 minute read

Published:

Kalman filter is an iterative mathematical process applied on consecutive data inputs to quickly estimate the true value (position, velocity, weight, temperature, etc) of the object being measured, when the measured values contain random error or uncertainty.

size in disk

How to Check the Size of a Dataframe?

1 minute read

Published:

Have you ever wondered how the size of a dataframe can be discovered? Perhaps it sounds not so fancy thing to know, yet I think there are certain cases requiring us to have pre-knowledge of the size of our dataframe. One of them is when we want to apply broadcast operation. As you might’ve already knownn, broadcasting requires the dataframe to be small enough to fit in memory in each executor. This implicitly means that we should know about the size of the dataframe beforehand in order for broadcasting to be applied successfully. Just FYI, broadcasting enables us to configure the maximum size of a dataframe that can be pushed into each executor. Precisely, this maximum size can be configured via spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, MAX_SIZE).

size in memory

How to Check the Size of a Dataframe?

1 minute read

Published:

Have you ever wondered how the size of a dataframe can be discovered? Perhaps it sounds not so fancy thing to know, yet I think there are certain cases requiring us to have pre-knowledge of the size of our dataframe. One of them is when we want to apply broadcast operation. As you might’ve already knownn, broadcasting requires the dataframe to be small enough to fit in memory in each executor. This implicitly means that we should know about the size of the dataframe beforehand in order for broadcasting to be applied successfully. Just FYI, broadcasting enables us to configure the maximum size of a dataframe that can be pushed into each executor. Precisely, this maximum size can be configured via spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, MAX_SIZE).

skewness

smote

spark

Retrieving Rows with Duplicate Values on the Columns of Interest in Spark

5 minute read

Published:

There are several ways of removing duplicate rows in Spark. Two of them are by using distinct() and dropDuplicates(). The former lets us to remove rows with the same values on all the columns. Meanwhile, the latter lets us to remove rows with the same values on multiple selected columns.

Pseudo-distributed LIME via PySpark UDF

2 minute read

Published:

The initial question that popped up in my mind was how to make LIME performs faster. This should be useful enough when the data to explain is big enough.

Spark History Server: Setting Up & How It Works

2 minute read

Published:

Application monitoring is critically important, especially when we encounter performance issues. In Spark, one way to monitor a Spark application is via Spark UI. The problem is, this Spark UI can only be accessed when the application is running.

Making mapPartitions Accepts Partition Functions with More Than One Arguments

1 minute read

Published:

There might be a case where we need to perform a certain operation on each data partition. One of the most common examples is the use of mapPartitions. Sometimes, such an operation probably requires a more complicated procedure. This, in the end, makes the method executing the operation needs more than one parameter.

Sigma Operation in Spark’s Dataframe

1 minute read

Published:

Have you ever encountered a case where you need to compute the sum of a certain one-item operation? Consider the following example.

Modifying the Code Profiler to Use Custom sort_stats Sorters

3 minute read

Published:

Code profiling is simply used to assess the code performance, including its functions and sub-functions within functions. One of its obvious usage is code optimisation where a developer wants to improve the code efficiency by searching for the bottlenecks in the code.

Union Operation After Left-anti Join Might Result in Inconsistent Attributes Data

3 minute read

Published:

Unioning two dataframes after joining them with left_anti? Well, seems like a straightforward approach. However, recently I encountered a case where join operation might shift the location of the join key in the resulting dataframe. This, unfortunately, makes the dataframe’s merging result inconsistent in terms of the data in each attribute.

Resolving Reference Column Ambiguity After Self-Joining by Deep Copying the Dataframes

5 minute read

Published:

I encountered an intriguing result when joining a dataframe with itself (self-join). As you might have already known, one of the problems occurred when doing a self-join relates to duplicated column names. Because of this duplication, there’s an ambiguity when we do operations requiring us to provide the column names.

How to Check the Size of a Dataframe?

1 minute read

Published:

Have you ever wondered how the size of a dataframe can be discovered? Perhaps it sounds not so fancy thing to know, yet I think there are certain cases requiring us to have pre-knowledge of the size of our dataframe. One of them is when we want to apply broadcast operation. As you might’ve already knownn, broadcasting requires the dataframe to be small enough to fit in memory in each executor. This implicitly means that we should know about the size of the dataframe beforehand in order for broadcasting to be applied successfully. Just FYI, broadcasting enables us to configure the maximum size of a dataframe that can be pushed into each executor. Precisely, this maximum size can be configured via spark.conf.set(“spark.sql.autoBroadcastJoinThreshold”, MAX_SIZE).

Effects of Shuffling on RDDs and Dataframes Partitioning

8 minute read

Published:

In Spark, data shuffling simply means data movement. In a single machine with multiple partitions, data shuffling means that data move from one partition to another partition. Meanwhile, in multiple machines, data shuffling can have two kinds of work. The first one is data move from one partition (A) to another partition (B) within the same machine (M1), while the second one is data move from partition B to another partition (C) within different machine (M2). Data in partition C might be moved to another partition within different machine again (M3).

Speeding Up Parquet Write

3 minute read

Published:

Parquet is a file format with columnar style. Columnar style means that we don’t store the content of each row of the data. Here’s a simple example.

Too Lazy to Process the Whole Dataframe

2 minute read

Published:

One of the characteristics of Spark that makes me interested to explore this framework further is its lazy evaluation approach. Simply put, Spark won’t execute the transformation until an action is called. I think it’s logical since when we only specify the transformation plan and don’t ask it to execute the plan, why it needs to force itself to do the computation on the data? In addition, by implementing this lazy evaluation approach, Spark might be able to optimize the logical plan. The task of making the query to be more efficient manually might be reduced significantly. Cool, right?

Lightning Fast Pandas UDF

5 minute read

Published:

Spark functions (UDFs) are simply functions created to overcome speed performance problem when you want to process a dataframe. It’d be useful when your Python functions were so slow in processing a dataframe in large scale. When you use a Python function, it will process the dataframe with one-row-at-a-time manner, meaning that the process would be executed sequentially. Meanwhile, if you use a Spark UDF, Spark will distribute the dataframe and the Spark UDF to the provided executors. Hence, the dataframe processing would be executed in parallel. For more information about Spark UDF, please take a look at this post.

Custom Partitioner for Repartitioning in Spark

8 minute read

Published:

A statement I encountered a few days ago: “Avoid to use Resilient Distributed Datasets (RDDs) and use Dataframes/Datasets (DFs/DTs) instead, especially in production stage”.

List of Spark Machine Learning Models & Non-overwritten Prediction Columns

2 minute read

Published:

I was implementing a paper related to balanced random forest (BRF). Just FYI, a BRF consists of some decision trees where each tree receives instances with a ratio of 1:1 for minority and majority class. A BRF also uses m features selected randomly to determine the best split.

Spark Accumulator

1 minute read

Published:

A few days ago I conducted a little experiment on Spark’s RDD operations. One of them was foreach operation (included as an action). Simply, this operation is applied to each rows in the RDD and the kind of operation applied is specified via a certain function. Here’s a simple example:

spark sql

Using SparkSQL in Metabase

5 minute read

Published:

Basically, Metabase’s SparkSQL only allows users to access data in the Hive warehouse. In other words, the data must be in Hive table format to be able to be loaded.

Ensuring Dataframe Partitions After Equi-joining (Inner)

6 minute read

Published:

The problem is really simple. After equi-joining (inner) two dataframes, a certain operation is applied to each partition. Precisely, such an operation can be accomplished by the following code:

spark streaming

Repartitioning Input Data Stream

3 minute read

Published:

Recently I played with a simple Spark Streaming application. Precisely, I investigated the behavior of repartitioning on different level of input data streams. For instance, we have two input data streams, such as linesDStream and wordsDStream. The question is, is the repartitioning result different if I repartition after linesDStream and after wordsDStream?

sqoop

Using SparkSQL in Metabase

5 minute read

Published:

Basically, Metabase’s SparkSQL only allows users to access data in the Hive warehouse. In other words, the data must be in Hive table format to be able to be loaded.

stack frame

Stack Frame

5 minute read

Published:

To discuss about this stack frame, we’ll see from Assembly language point of view.

stack memory

standalone mode

standard error of mean

Standard Error of Mean Estimate Derivation

4 minute read

Published:

Suppose we conduct K experiments on a kind of measurement. On each experiment, we take N observations. In other words, we’ll have N * K data at the end.

static model

Kalman Filter for Static State & Single Measurement

1 minute read

Published:

Kalman filter is an iterative mathematical process applied on consecutive data inputs to quickly estimate the true value (position, velocity, weight, temperature, etc) of the object being measured, when the measured values contain random error or uncertainty.

statistic

statistical machine translation

statistical test

One-sample Z-test with p-value Approach

8 minute read

Published:

One sample z-test is used to examine whether the difference between a population mean and a certain value is significant.

statistics

White Noise Time Series

1 minute read

Published:

White noise series has the following properties:

  • Mean equals to zero
  • Standard deviation is constant
  • Correlation between lags (lag > 0) is close to zero (each autocorrelation lies within the bound which shows no statistically significant difference from zero)

Hypothesis Testing with the Kruskal-Wallis Test

6 minute read

Published:

The Kruskal-Wallis test is a non-parametric statistical test that is used to evaluate whether the medians of two or more groups are different. Since the test is non-parametric, it doesn’t assume that the data comes from a particular distribution.

Kalman Filter for Dynamic State & Multiple Measurements

2 minute read

Published:

In the previous post, we discuss about the implementation of Kalman filter for static state (the true value of the object’s states are constant over time). In addition, the Kalman filter algorithm is applied to estimate single true value.

Kalman Filter for Static State & Single Measurement

1 minute read

Published:

Kalman filter is an iterative mathematical process applied on consecutive data inputs to quickly estimate the true value (position, velocity, weight, temperature, etc) of the object being measured, when the measured values contain random error or uncertainty.

Sample Size is Matter for Mean Difference Testing

1 minute read

Published:

It’s quite bothering when reading a publication that only provides a “statistically significant” result without telling much about the analysis prior to conducting the experiment.

Data Dredging (p-hacking)

3 minute read

Published:

“If you torture the data long enough, it will confess to anything” - Ronald Coase.

Moment Generating Function

2 minute read

Published:

As the name suggests, moment generating function (MGF) provides a function that generates moments, such as E[X], E[X^2], E[X^3], and so forth.

One-sample Z-test with p-value Approach

8 minute read

Published:

One sample z-test is used to examine whether the difference between a population mean and a certain value is significant.

Maximum Likelihood Estimation

3 minute read

Published:

If in the probability context we state that P(x1, x2, ..., xn | params) means the probability of getting a set of observations x1, x2, …, and xn given the distribution parameters, then in the likelihood context we get the following.

Standard Error of Mean Estimate Derivation

4 minute read

Published:

Suppose we conduct K experiments on a kind of measurement. On each experiment, we take N observations. In other words, we’ll have N * K data at the end.

status tracking

stopiteration

stream processing

WTF is Kafka? A High-level Overview

7 minute read

Published:

Basically, you can presume Kafka as a messaging system. When an application sends a message to another application, one thing they need to do is to specify how to send the message. The most obvious use case in using a messaging system, in my opinion, is when we’re dealing with big data. For instance, a sender application shares a large amount of data that need to be processed by a receiver application. However, the processing rate by the receiver is lower than the sending rate. Consequently, the receiver might be overloaded since it’s unable to receive messages anymore while the processing is running. Although we’re using distributed receivers, we still have to tell the sender about which receiver node it should send the message to.

streaming

Incremental Query for Large Streaming Data Operation

4 minute read

Published:

In the previous post, I wrote about how to perform pandas groupBy operation on a large dataset in streaming way. The main problem being addressed is optimum memory consumption since the data size might be extremely large.

Streaming GroupBy for Large Datasets with Pandas

7 minute read

Published:

I came across an article about how to perform groupBy operation for large dataset. Long story short, the author proposes an approach called streaming groupBy where the dataset is divided into chunks and the groupBy operation is applied to each chunk. This approach is implemented with pandas.

structured streaming

support vector machine

tabular data

task instance

threshold

time series

White Noise Time Series

1 minute read

Published:

White noise series has the following properties:

  • Mean equals to zero
  • Standard deviation is constant
  • Correlation between lags (lag > 0) is close to zero (each autocorrelation lies within the bound which shows no statistically significant difference from zero)

toeplitz

topic

tree structured parzen estimator

triple roots

true positive rate

twitter

udf

Pseudo-distributed LIME via PySpark UDF

2 minute read

Published:

The initial question that popped up in my mind was how to make LIME performs faster. This should be useful enough when the data to explain is big enough.

uncertainty

Kalman Filter for Dynamic State & Multiple Measurements

2 minute read

Published:

In the previous post, we discuss about the implementation of Kalman filter for static state (the true value of the object’s states are constant over time). In addition, the Kalman filter algorithm is applied to estimate single true value.

Kalman Filter for Static State & Single Measurement

1 minute read

Published:

Kalman filter is an iterative mathematical process applied on consecutive data inputs to quickly estimate the true value (position, velocity, weight, temperature, etc) of the object being measured, when the measured values contain random error or uncertainty.

union

Union Operation After Left-anti Join Might Result in Inconsistent Attributes Data

3 minute read

Published:

Unioning two dataframes after joining them with left_anti? Well, seems like a straightforward approach. However, recently I encountered a case where join operation might shift the location of the join key in the resulting dataframe. This, unfortunately, makes the dataframe’s merging result inconsistent in terms of the data in each attribute.

union by name

unique row id

Adding Strictly Increasing ID to Spark Dataframes

3 minute read

Published:

Recently I was exploring ways of adding a unique row ID column to a dataframe. The requirement is simple: “the row ID should strictly increase with difference of one and the data order is not modified”.

variance

vieta

vieta product

wallis product

weight of evidence

weka

white noise

White Noise Time Series

1 minute read

Published:

White noise series has the following properties:

  • Mean equals to zero
  • Standard deviation is constant
  • Correlation between lags (lag > 0) is close to zero (each autocorrelation lies within the bound which shows no statistically significant difference from zero)

window

Retrieving Rows with Duplicate Values on the Columns of Interest in Spark

5 minute read

Published:

There are several ways of removing duplicate rows in Spark. Two of them are by using distinct() and dropDuplicates(). The former lets us to remove rows with the same values on all the columns. Meanwhile, the latter lets us to remove rows with the same values on multiple selected columns.

window function

workers

workflow management

Setting Up & Debugging Airflow On Local Machine

5 minute read

Published:

Airflow is basically a workflow management system. When we’re talking about “workflow”, we’re referring to a sequence of tasks that needs to be performed to accomplish a certain goal. A simple example would be related to an ordinary ETL job, such as fetching data from data sources, transforming the data into certain formats which in accordance with the requirements, and then storing the transformed data to a data warehouse.

xgboost

z-score

One-sample Z-test with p-value Approach

8 minute read

Published:

One sample z-test is used to examine whether the difference between a population mean and a certain value is significant.

z-test

Sample Size is Matter for Mean Difference Testing

1 minute read

Published:

It’s quite bothering when reading a publication that only provides a “statistically significant” result without telling much about the analysis prior to conducting the experiment.

One-sample Z-test with p-value Approach

8 minute read

Published:

One sample z-test is used to examine whether the difference between a population mean and a certain value is significant.

zeta function

Riemann Hypothesis and One Question in My Mind

7 minute read

Published:

Yesterday I came across an interesting Math paper discussing about the Riemann hypothesis. Regarding the concept itself, there’s lots of maths but I think I enjoyed the reading. Frankly speaking, although mathematics is one of my favourite subjects, I’ve been rarely playing with it (esp. pure maths) since I got acquainted with AI and big data engineering world. Now I think it’s just fine to play with it again. Just for fun.