Saturday, May 19, 2018

I saw Levi Joseph speak at the St. Louis .NET User Group on being a data scientist on Monday night.

Levi, is specifically a data scientist at Sense Corp at its Austin locale (well Rollingwood, Texas) not it's St. Louis locale and was quick to mention that being a data scientist in terms of a title may be something that comes and goes like "webmaster" as things evolve over time. Right now data science is in its infancy and is as wishy-washy in terms of definition as being a "webmaster" was. Levi offered a few definitions that he liked for being a data scientist and amongst them where: Data science is the generalized extraction of knowledge from data. One of his slides had a Venn diagram with three circles labeled Domain Expertise, Computing Skills, and Math & Statistical Knowledge and in the middle, where all three circles overlapped a little bit, was Data Science. The field of data science where Levi spends most of his time is in preventive maintenance. If you've collected a bunch of data on when airplane parts fail, you can predict when airplane parts fail and also predict RUL (remaining useful life) for parts in quantity of hours and also full life, which is a different metric, along the same lines and be flagged if a part is so error-prone that you should abandon it and use a different part from a different vendor, etc. This creates a cost savings in the big picture. If you are doing this and your competitors are not, you have a leg up. Levi spends a lot of his time doing the ETL work to get existing collected data into shape for his customers such that numbers fall into a decimal range between zero and one and the like. Time of failure is the reciprocal of RUL. SSDT (SQL Server Data Tools) which allow you to model data and play nicely with Azure beyond just SSMS (SQL Server Management Studio) and also, yes, SSMS were name dropped and recommend. Bespoke was name dropped too. A glance at their website makes them look like consultants in their space to me. I don't know if they offer tools too. Azure IoT is the best for managing IoT data right now. RStudio is the IDE for R which is the language Levi uses the most. DataRobot is the best tool for machine learning, a subfield of artificial intelligence for, over time, recursion for better performance. Microsoft Azure Machine Learning Studio is also legit as is ML.NET from Microsoft for machine learning. Beyond the name dropping of tools, there was some name dropping of statistics terms:

  • True Positive — We predicted it. It happened.
  • True Negative — We predicted it would not happened and it didn't.
  • False Positive — We predicted it, but it didn't happen.
  • False Negative — We predicted it would not happen, yet it happened nonetheless.

Classification has to do with separating a dataset into two categories and regression has to do with finding a direction things are trending. Overfitting and underfitting are important terms in regression logistics. In charting how an airplane part may decay over time in reaching the end of RUL and complete life beyond that there may be some checkpoints along the way. I dunno, maybe warranty expiration comes before end of RUL typically. Anyhow, in writing an algorithm that perfectly hits all of the points in time predictively, one is guilty of overfitting. You are going to be writing some silly rules to be turning some tight corners and you'll introduce noise to the algorithm that isn't very good at palm reading the future. You need to come near some of the plot points not hit them. Loosen up your rules to where they make sense. If you loosen up too much though you are guilty of underfitting, the act making an algorithm too vague to be predictive and helpful. Underfitting has high bias and overfitting has high variance. Somewhere in the middle is the "just right" Goldilocks porridge. A distinction was made between Weak AI and Strong AI in the artificial intelligence arena. Weak AI can make a narrow set of decisions like a chatbot. Strong AI is the Skynet from Terminator stuff and while there are plenty of examples of Weak AI actively a part of the tech sphere today there is no Skynet and Levi expressed pessimism about humanity ever getting there. The most impressive work being done in AI has a team of PhDs at Google putting the "um" pause in computer generated speech to make it seem more lifelike. We are a long way from falling in love with Scarjo's voice in that "Her" movie I never saw. IBM and Bell Labs chips have been doing the "computing skills" part of the Venn diagram for years, but now the laptop of Average Joe is starting to be able to compete. IBM is kinda in decay in the data science space as there is just constant firefighting with Watson. Microsoft is doing better and better in the space. The SAS (Statistical Analysis System) company that makes the SAS language is a major player too, but SAS the language is losing market share to Python and R and going the way of FORTRAN (name means Formula Translation). R is doing better than Python presently, but Python looks poised to overtake it. A Matthew Bowers, a dBase III (an early database for personal computers or microcomputers if you will) programmer, gave the opening lighting talk before Levi Joseph spoke. He spoke to the Uber Effect in which small businesses disrupt established ones such as Uber killing taxicabs or Amazon killing Barnes and Noble (or at least Borders Group) or Netflix killing Blockbuster and posed the question "Are you going to be a disruptor or the disrupted?" (I'm paraphrasing a hint.) Good CX (customer experience) depends on understanding your customers and by 2020 there should be fifty billion devices cross talking with the internet, so if you don't understand what is behind your revenue stream as a business owner, well...

No comments:

Post a Comment