Machine Learning Blueprint Newsletter, Edition 25, 6/17/18

 View Only

Machine Learning Blueprint Newsletter, Edition 25, 6/17/18 

Wed June 19, 2019 03:56 PM

B
This newsletter is written and curated by Mike Tamir and Mike Mansour. 

June 17, 2018

Hi all,
Hope you enjoy this week's ML Blueprint. This week is brought to you by fastdata.io.

Spotlight Articles
Traditional approaches to predicting the outcomes of sporting events previously have statistically combined several different bookie-generated odds, but German researchers have taken a more machine learning approach by utilizing random forests on a large feature set. Currently, bookies have marked Germany to win the cup, but this algorithm pins Spain to win - however this may change if in the unlikely case there is an upset, or of course one of those teams doesn’t make it to the finals. The calculation is difficult due to the massive number of different outcome configurations. Kaggle released an analysis of the data here, and the research paper on the proposed method
Machine Learning Blueprint's Take
Just like how a good trader does not reveal their successful trading strategy (or algorithm for that matter), it’s likely that a good gambler does not reveal their accurate prediction algorithm if they want to retain the upper hand. In this case, there probably exists much more sophisticated models for predicting sporting event-outcomes that have not been shared. But on another note, publicly releasing a prediction machine might now have an effect on the payouts; what if the researchers chose a model that provided a certain outcome given the data to influence the global betting markets and payouts to their advantage? Just putting on the old tin foil hat...
Technical debt is a software engineering term for costs, brittleness and slowed innovation accumulated overtime in a system by the dilemma between speed of execution, and quality of engineering. The cost of debt compounds overtime as well, resulting in expensive cleanups. Google takes a ‘systems view’ approach to productionized machine learning, and show that it is not immune to various forms of technical debt. Frequently, debt comes from hastily adding data-features, or model complexity for a small accuracy boost. Other times there may be hidden feedback loops or unintended users of a signal that then require more maintenance of an internal product. They share best practices for avoiding technical debt at beginning stages of architectural planning and systems crossroads.
Machine Learning Blueprint's Take
Given the lack of industry solutions 4 years on, this topic is worth a revisit: This classic paper points out a surprisingly large number of channels in which the various incarnations of technical debt can sneak into an ML system. As the paper’s title suggests that it’s way too easy to accumulate it, and detecting the effects of debt subtle or difficult. This may come from the fact that data science is an inherently multidisciplinary field - engineers must be computer scientists, statisticians and mathematicians, thus making it easier to gloss over systems engineering. One way of sharing systems-level-thinking is to invest in more diverse teams, having engineers and data scientists work closely together.
Learning Machine Learning
Well labeled & copious training data is hard to come by, frequently forcing engineers to employ more advanced techniques to model well. However, there are frequently bad-labels that break these models, and the data-collection assumptions of academic datasets typically don’t match a commercial application. For example, ImageNet is unsuited for drone-related tasks, as all the shots are taken by humans at ground level, whereas the drone perceives from a birds-eye-view. This author shares more examples and his lessons learned for improving a dataset that start with the simple manual-investigation of a large sample, and end with investing in better processes for collecting + labeling data.
Machine Learning Blueprint's Take
A sticking point here is that data labeling is usually a human-process, and that comes with hidden biases that affect the label-distributions, or even data collection. Even the data-labeling task description will have effects on the output. The suggestions for improving training data and effectively collecting more, are, well manual and require more humans-in-the-loop, but that cost is arguably lower than having to develop more complex models that require deeper understanding + computation power. Plus, a dataset can be more of an asset than a model in a secondary market since it can serve several potential model-training needs.
Constrained optimization is the problem of minimizing or maximizing some objective subject to other constraints. Frequently, these constraints might be business requirements like keeping cost or labor to a minimum. StitchFix lays out a toy example and works through the math while recommending some python libraries for implementing your model with required solvers.
Most of the tips here might not apply for someone in a corporate setting, but rather academic. They suggest finding someone to ask “dumb” questions of, share methods for finding research inspiration, tricks to manage time and track progress.

Machine Learning News
Not another article about how AI is curing the next disease, but rather a highlight of all the ways St. Joseph's is integrating machine learning to optimize processes across its 51-different hospitals. They’re able to predict and reduce appointment no-shows, recommend health-related content to patients based on their medical history, utilize propensity models to better target patients for services they’re likely to acquire, and optimize delivery systems to minimize care disruption.
Machine Learning Blueprint's Take
AI for medical research is risky and the data is rather limited. In the case here, there is an abundant amount of data, and the outcomes can move the needle on providing better care. Additionally, the cost of false positives or false negatives is far less costly in these applications, suggesting a higher chance of acceptability by healthcare leaders. There are a few opportunity areas for data scientists to tackle with statistics and software if you read deeper.
Dubbed “Project Debater”, this new software makes an argument on a topic, gives rebuttals and even forms closing sentences in what is a clear extension of the Jeopardy-playing machine with extra bells and whistles. Some are lambasting the system for not actually having any understanding, but just combining previous argument elements and points off of Wikipedia.
Machine Learning Blueprint's Take
Aside from the glitzy IBM-style marketing demo of the system against human debaters, they cite some potential use cases for this technology like helping people make more rational decisions. On the flipside, it could also be trained to troll humans on internet forums - putting many out of work at the Russian Internet Research Agency.
Sponsored
TuSimple evaluated WekaIO Matrix™ to standard NAS solutions and legacy file systems, and found that Matrix delivered better scalability and performance. WekaIO leapfrogs legacy storage infrastructures and future-proofs datacenters by delivering the world’s fastest parallel file system with the most flexible deployment options—on-premises, cloud, or cloud bursting. Matrix software is ideally suited for latency-sensitive business applications at scale such as AI and machine learning.
New chips that have microelectronic synapses mimic the brain to achieve a neuron-like structure. The approach uses two kinds of “synapses”, shorter term ones for computation and longer term ones for memory. The proposed benefit in mimicking the brain is mainly in energy consumption, since the human brain as a model consumes a low-amount of power. Research paper here.
Interesting Research
Steganography is the art of concealing information, and may be used for exfiltrating data out of monitored environments or even for generating covert communications channels. In this case, researchers use GAN’s to to alter fonts ever so slightly to encode secret messages. A font-manifold is learned across a few select fonts, and various points along it are assigned to another ascii-character. Characters are then transformed by a GAN along this manifold, and the resulting covert-text channel is printed out or converted to an image. To extract the hidden message, characters are bounded and extracted, fed to a trained CNN that then outputs the secret code.
Machine Learning Blueprint's Take
Apparently this method breaks down if the paper with the hidden message is crumpled, suggesting that this communication channel has low capacity and could be defeated with strategically placed noise (they do train the CNN with some Gaussian noise on the generated characters). There are several modifications that would make this approach much more robust to errors, like leveraging statistical properties in language to encode information, or even using multiple characters to transmit a single character.

#GlobalAIandDataScience
#GlobalDataScience

Statistics

0 Favorited
8 Views
0 Files
0 Shares
0 Downloads