This presentation was recorded at YOW! 2019. #GOTOcon #YOW https://yowcon.com Huon Wilson - Software Engineer at CSIRO's Data61 RESOURCES https://www.linkedin.com/in/huon-wilson https://twitter.com/huon_w https://github.com/huonw https://huonw.github.io ABSTRACT Real world #data is rarely clean: there are often corrupted and duplicate records, and even corrupted records that are duplicates! One step in #DataCleaning is #EntityResolution: connecting all of the duplicate records into the single underlying entity that they represent. This talk will describe how we approach entity resolution, and look at some of the challenges, solutions and lessons learnt when doing entity resolution on top of #ApacheSpark, and scaling it to process billions of records. [...] RECOMMENDED BOOKS Adi Polak • Machine Learning with Apache Spark • https://amzn.to/3ppdUkB Holden Karau & Rachel Warren • High Performance Spark • https://amzn.to/3v2eLbn Holden Karau, Konwinski, Wendell & Zaharia • Learning Spark • https://amzn.to/397e2NE https://twitter.com/GOTOcon https://www.linkedin.com/company/goto- https://www.instagram.com/goto_con https://www.facebook.com/GOTOConferences #DataEngineering #HuonWilson #SoftwareEngineering #Programming #YOWcon Looking for a unique learning experience? Attend the next GOTO conference near you! Get your ticket at https://gotopia.tech Sign up for updates and specials at https://gotopia.tech/newsletter SUBSCRIBE TO OUR CHANNEL - new videos posted almost daily. https://www.youtube.com/user/GotoConferences/?sub_confirmation=1
Get notified about new features and conference additions.