Who am I?

About

Søren Kierkegaard - «Life can only be understood backwards; but it must be lived forwards» I trust in my gut, destiny, life, karma, and whatever for doing good; I believe that the dots will connect down the road in the future.

Biography

DB Tsai (蔡東邦) is a Senior Engineering Manager at Apple supporting Spark, Flink, Hadoop, HBase, and Data Security teams. He is an Apache Spark Project Management Committee (PMC) Member and Committer, and he enjoys building teams with great cultures focusing on large scale distributed data infrastructure. Before his transition to a leadership role, he implemented several algorithms including Linear Regression and Binary/Multinomial Logistic Regression with Elastici-Net (L1/L2) regularization using LBFGS/OWL-QN optimizers in Apache Spark project.

Prior to joining Apple, DB was a Lead Machine Learning Engineer at Netflix working on Personalized Recommendation Algorithms and Machine Learning infrastructure, where he developed innovative large-scale distributed learning algorithms, and then contributed back to open source Apache Spark project.

He is a big fan of Scala programming language, and has been using it together with Apache Spark to build scalable and robust cloud-driven recommendation system and machine Learning applications.

DB was a Ph.D. candidate in Applied Physics at Stanford University. He holds a Master’s degree in Electrical Engineering from Stanford University, as well as a Master’s degree in Physics from National Taiwan University. He received his Bachelor’s degree in Physics from National Cheng Kung University.

Resume [PDF]

Summary

I specialize in big data machine learning with strong background in theoretical statistics and mathematics.
I’ve implemented various distributed machine learning algorithms using Hadoop and Spark for large scale data processing, and contributed back to open source communities.
I’ve been actively involved with the open source Apache Spark developement as a committer.

Specialties

Distributed Machine Learning and Data Mining.
Apache Hadoop and Spark stack.
Computer languages such as Scala, Java, Python, C, and C++.
Mathematical scripting languages (Matlab and R).
Parallel Computing and Big Data Processing using MapReduce and MPI.

Experience

Apache Spark — A fast and general engine for large-scale data processing
- Committer from May 2015 to current
- Project Management Committee (PMC) Member from June 2017 to current
- My contributions, [GitHub]
- Implemented new features such as L-BFGS, and Multinomial / Binomial Logistic Regression, etc.
- Conducted code review for other contributors, and guided them until the code is merged.
- Fixed various bugs, wrote documentation and performed performance optimization.
Netflix, Los Gatos, CA — A Leading Provider of Internet Streaming Media Available Worldwide
- Senior Research Engineer from Apr. 2015 to current
- Worked on personalized recommendation algorithms and machine learning infrastructure
- Architected and implemented Distributed Time Travel Machine for Feature Generation using Apache Spark, which enables our researchers to quickly try ideas for new features on historical data such that running offline experiments and transitioning to online A/B tests is seamless. This framework reduces the time to bring an offline experiments to online A/B tests from months to weeks, and significantly removes the offline/online discrepancy because of sharing the feature generation logics between offline/online. U.S. Patent filed February 2016. Patent Pending.
- Implemented categorical feature learner in Netflix’s in-house GBDT (Gradient Boosting Decision Tree) implementation as part of the global algorithm effort to incorporate the country and language categorical signals.
- Implemented Weighted Logistic Regression in open source Apache Spark ML which is used in Netflix’s personalized page algorithms for constructing the rows in the homepage.
- Worked closely with Apache Spark community to merge our changes, and implemented new features for our needs.
SF Machine Learning Meetup, CA — People with Shared Interests of Machine Learning and Big Data
- Co-Organizer from Jun. 2013 to July 2015
- [Meetup Page]
- Had more than 2700 machine learning enthusiasts in the community.
- Hold the meetup monthly, and invited famous speakers in industry and academic to give talks.
Alpine Data Labs, San Francisco, CA — The Leader in Data Science for Big Data
- Machine Learning Lead from Aug. 2014 to April 2015
- Machine Learning Engineer from Apr. 2013 to Aug. 2014
- Developed scalable Multinomial Logistic Regression and Linear Regression with elastic-net regularization which linearly combines the L1 and L2 penalties in Apache Spark. Implemented OWLQN for L1/L2 regularized optimization.
- Developed scalable algorithms such as Decision Tree, Variable Selection based on Information Gain, exact one-pass Linear Regression with L2 penalty, and PCA in Hadoop MapReduce.
- Migrated build infrastructure from ANT to SBT for better third party library dependency management using the Maven central repository, better intergation with Jenkins for continuous integration, better developement/debuging experience for developers, and easier release build.
KeeKa, StartX 2012 Summer, Stanford, CA — A Social Network Connecting People through Fashion
- Co-founder and CTO from Jan. 2012 to Mar. 2013
- Planned the strategies and invented a disruptive product.
- Designed the architecture of the website, including deployment, front-end, and back-end systems.
- Coordinated the designer, front-end team, and back-end team and performed the code review to ensure reliability, effectiveness, progress, and productivity.

Education

Stanford University, California, U.S.A.
- ABD in Applied Physics Ph.D. program from Sept. 2010 to June 2012
- M.S. in Electrical Engineering from Sept. 2010 to June 2012
National Taiwan University, Taipei, Taiwan
- M.S. in Physics from Sept. 2006 to July 2008
National Cheng Kung University, Tainan, Taiwan
- B.S. in Physics from Sept. 2002 to June 2006