Global AI and Data Science

Wed June 19, 2019 03:54 PM

Michael Tamir

This newsletter is written and curated by Mike Tamir and Mike Mansour.

June 10, 2018

Hi all,

Hope you enjoy this week's ML Blueprint. This week is brought to you by fastdata.io.

Spotlight Articles

Deepfake Videos Are Getting Impossibly Good

Expanding on the Synthesising Obama DeepFake generation from about a year ago, new advancements to be presented at the SIGGRAPH 2018 this autumn take it to a whole new level. Now, given input video of an actor, they can produce corresponding output video that has minimal artifacts from the generation (see below gif -- note the image warping around the head). Aside from the obvious nefarious uses of this technology (DeepFakes & DeepPorno), one beneficial utilization may be to create really good dubbings of video into other languages.

Machine Learning Blueprint's Take

Since this was published in the ACM, we can expect to see the thought exercise on both positive and negative implications of this research, with some well laid out areas of future research that would cover detection strategies. See related on detecting DeepFakes.

[Link]

Google Released an AI Manifesto Following Fallout from the DoD Contract and Employee Outrage

Just pop your csv or json data into the browser and start making To clear the waters, Google publishes seven objectives for how their ML applications will operate, and outline a few redline territories they will not develop applications in, like weapons, surveillance, or abuses of international human rights (all with some caveats). Here are their “guidelines”:

Be socially beneficial.
Avoid creating or reinforcing unfair bias.
Be built and tested for safety.
Be accountable to people.
Incorporate privacy design principles.
Uphold high standards of scientific excellence.

[Link]

Attacks Against Machine Learning

Adversarial attacks have recently been making a buzz in the ML space, but there remain other attack vectors against deployed models that should encourage engineers to develop defenses. One attack vector, data poisoning, occurs when a malicious attacker tries to change a trend in your training dataset via feedback mechanisms, like reporting not-spam as spam in an effort to throw off the classifier. Attackers may also try to steal your model or at least learn about its decision boundaries, which in effect is stealing the intellectual property and data-collection efforts. This article responsibly provides a number of defense strategies for all three attack vectors.

Machine Learning Blueprint's Take

Security is most likely not at the top of a machine learning engineers mind, but as ML gets more integrated into products where the other developers may not appreciate the above implications, understanding the risks and defenses will keep the product performing as advertised. Unlike traditional security, it may be hard to determine when the algorithm is under attack, as it could be a valid shift if the data distribution. A useful recommendation suggested here is to implement more anomaly detection on interaction points to determine when user behavior is awry, and perhaps the data pool has become tainted.

[Link]

A message from our sponsors...

WekaIO Matrix™, the world’s fastest distributed file system for machine learning workloads, has been named a Cool Vendor by Gartner, Inc., in the Gartner Cool Vendors in Storage Technologies, 2018 report.

Matrix is the first and only NVMe-native shared and distributed file system that has been written to support new high-performance workloads in machine learning. The demands of 2018 workloads cannot be met with a thirty-year old protocol such as NFS, that was designed when networks were slow relative to the storage media.

Download the Gartner Cool Vendor 2018 Report now.

Learning Machine Learning

Real Time T-SNE Visualization with TensorFlow.js

t-distributed Stochastic Neighbor Embedding is a method for EDA of high-dimensional data, but previously came with a high computational expense. A new optimization is presented, leveraging GPUs via WebGL to speed up the N-Body objective function, where attractions and repulsions are calculated to generate the small cluster-neighborhoods. Now it’s fast enough to compute t-SNE’s in realtime in the browser, whereas MNIST previously took 15-minutes in a C++ implementation. Technical details on the optimizations here.

[Link]

Understanding Deep Learning for Object Detection

A series of summaries tracking state of the art in object detection as it evolved, starting with two-stage detection methods like R-CNN, and ending with single-stage methods like YOLO/SSD. It’s useful for getting a quick scope of the territory and choosing which methodology to explore for your use case.

Rademacher Complexity Explained

An alternative, and debatably better, solution to determining model capacity than VC dimension, Rademacher Complexity was recently used in the LeCun paper explaining why DNN’s don’t suffer generalizability with large parameter spaces. The post here steps through the equation and relates to it to intuitive research on human learning.

[Link]

Beyond Numpy Arrays - Other Useful Implementations & Extensions

Learn about CuPy to bring Numpy arrays to the GPU via CUDA, Sparse for, well ... sparse arrays, and Dask for extending the Numpy API for parallel execution on multi-core workstations or distributed clusters.

[Link]

Building a Query Understanding Engine

Data Science v.s. Statistics

Machine Learning News

Rumor Mill: GM Cruise Nearing a Launch of Driverless Cars in SF

According to emails accessed through the Freedom of Information Act between Cruise and the City of San Francisco, Cruise is pressuring the city to aid in some crucial data collection and pushing for a release. Mainly, they want the SFFD drive around some autonomous units with lights flashing to cover some potentially weak spots in their algorithm, however the administration refuses to proceed citing this as a poor use of resources. Nevertheless, many of the details within hint towards a nearing release.

Machine Learning Blueprint's Take

Correct operation around emergency vehicles and situations is important for releasing an autonomous unit on the road for both safety, and reputation. Interfering with an emergency situation would generate a terrible situation and risk a moratorium on autonomous vehicles. In this case, the number of training examples collected is, reasonably so, likely to be small; generating some situations in this manner seems reasonable and should likely be in the interest of emergency responders.

[Link]

AutoAugment

Following from Google’s AutoML, this tool uses reinforcement learning to find optimal image transformation policies to augment an image training data set on a number of different dimensions. Each image transformation depends on the entire training dataset and the image itself, so no two transformations will be the same. The research quotes a 0.83% increase in training error or CIFAR10.

Machine Learning Blueprint's Take

Collecting training data for custom purposes is expensive, and the lack of insane-scale data requires more complex methods for learning various invariance to the data. This may allow for simpler models, that then in turn are less computationally complex during inference in deployment.

[Link]

On The Dark-Side of ML - Training an Agent on Disturbing Material to Create The 1st “Psychopath” AI

A group of researchers trained an image-captioning algorithm on the most disturbing of materials found over in the darkest corners of reddit, like /r/gore & /r/watchpeopledie, and then had it provide captions to the famous inkblot tests, used for detecting underlying thought disorders. The results, are well, unsettling.

Machine Learning Blueprint's Take

The results may be unsettling, but not surprising. Of course the algorithm will come up with disturbing captions if that is all it is trained with, and ink blots themselves are amorphous enough that almost anything can be prescribed to them, and it would not be incorrect. A more meaningful test would be to compare the generated titles to those of what a real psychopath or disturbed person would have assigned.

[Link]

Largest Open Source Self Driving Database Released - 1000 Hours of Footage

Berkeley AI Research collaborated with Nexar to release a 100K sample dataset of 40-second videos at 720p and 30fps, sourced from around the United States. This makes it the largest and most diverse public dataset to date. It comes with a suite of annotations including: Image tagging, road object bounding boxes, drivable areas, lane markings, and full-frame instance segmentation. After you’ve toyed with the data sufficiently, you can enter any of the three challenges they’re hosting at the CVPR 2018 Workshop on Autonomous Driving based on this data.

[Link]

Keras 2.2.0 Release

Some minor updates added, like the ability to subclass Models, and process symbolic tensors. Overall, the Keras engine is more modular, and this appears to have been a technical-debt cleansing release cycle for the team.

[Link]

MLFlow

DataBricks releases the alpha version of their ML model lifecycle management tool, much in the vein of other enterprise versions like FBLearnerFlow (Facebook), TFX (Google), except that this is open source. MLFlow will help you track your experiments & code changes, provide reproducible training setups, and aid in managing and deploying models.

[Link]

Secret AI program to Hunt Nukes

Why We Won’t Get Away From Human-Assisted AI

Interesting Research

Relational Inductive Biases, Deep Learning, & Graph Networks

Many defining characteristics of human intelligence, which developed under much different pressures, remain out of reach for current advances in using deep learning for AI. In this paper the authors argue that combinatorial generalization must be a top priority for AI to achieve human-like abilities, and that structured representations and computations are key to realizing this objective. They explore how using relational inductive biases within deep learning architectures can facilitate learning about entities, relations, and rules for composing them, suggesting that graph networks can lay the foundation for more sophisticated, interpretable, and flexible patterns of reasoning.

[Link]

Towards Understanding the Role of Over-Parameterization in Generalization of Neural Networks

This article addresses the open question of why neural networks generalize better with over-parametrization. The researchers suggest a novel complexity measure based on unit-wise capacities. This new measure correlates with the behavior of test error with increasing network sizes, and could potentially explain the improvement in generalization.

[Link]

DeepPhys: Video-Based Physiological Measurement Using Convolutional Attention Networks

#GlobalAIandDataScience
#GlobalDataScience

More Data Science News