Why? Recently, I am learning about Apache Spark distributed computing platform that executes Machine Learning tasks for Big-Data system. Spark has 2 languages used, scala and python (pypark), so this article refers to Pyspark, maybe I will write an article about scala (not today :heart_eyes:). The full code can be found at the repo Github. Vietnamese version can be read at Vie. The problem is that I have a super large data file about dogs and cats that need to be filtered into two separate pla...