About Me

Looking for both internships in this year and next year. Feel free to contact me.

I am a third-year Ph.D student in Graduate School of Knowledge Service Engineering (Department of Industrial & System Engineering) at Korea Advanced Institute of Science and Technology (KAIST). Currently, my advisor is Prof. Jae-Gil Lee, and I am a member of Data Mining Lab.

My general research interests lie in improving the performance of machine learning (ML) techniques under real-world scenarios. I am particularly interested in designing more advanced approaches to handle (i) large-scale data (previous research) and (ii) noisy data (current research), which are two main real-world challenges to hinder the practical use of ML approaches.
[research statement]

Previous Research: ML on Large-scale Data

As the amount of data increases rapidly, many ML algorithms have achieved remarkable performance in numerous tasks such as document categorization and image classification. However, the extremely high computational cost for the large-scale data makes them infeasible in real-world. To this end, many researchers approximately decomposed the algorithm into small ones and then performed them in distributed environment such as Hadoop and Spark. This approach greatly improved the efficiency, but still suffered from the following limitations:

L1: Accuracy Degradation: Because most of ML algorithms were designed to run on a single machine, it is not trivial to decompose the algorithm for the purpose of parallelization. Thus, many studies divided the entire data into multiple partitions and then simply applied the algorithm in parallel without any guarantee of accuracy.

L2: Load Imbalance: In the ML algorithms such as DBSCAN, neighboring objects must be assigned to the same data partition for parallel processing to facilitate calculation of the density of the neighbors. That is, the entire data is divided into multiple contiguous sub-regions. However, such region-based partitioning scheme causes the load imbalance problem because the data distribution in the subregions tend to be highly diverse in real-world. In MapReduce paradigm, because the execution time is determined by the slowest worker, balancing the load between data partitions is very challenging problem.

My previous work focuses on re-interpreting the widely used machine learning algorithms from the perspective of the distributed computing, and resolving the above limitations to improve the usability of them.

Current Research: ML on Noisy Data

In standard supervised learning, labels of training data are assumed to be true, but they may not be true in real-world because the labeling process is highly cost and time consuming. Such noisy labels lead to poor performance of supervised ML algorithms. In particular, owing to the high capacity to fit any noisy labels, deep neural networks are known to be extremely vulnerable to such label noise. My recent work focuses on training deep neural networks more robustly under the data with label noise.


Email: songhwanjun@kaist.ac.kr

WWW: https://songhwanjun.github.io

Address: Room 1217, Bldg. E2-1, KAIST 291 Daehak-ro, Yuseong-gu, Daejeon 34141 [Map]