Table of Contents
- Robustness and generalizability
- Directions and applications
This year, I had a chance to attend NeurIPS, the most prominent conference in artificial intelligence and machine learning (AI/ML), to present a workshop paper. I’ve spent the past couple of years working on a combination of AI research in various subfields and tech startups and so have been following the evolution of AI with interest. This conference, bringing together as it does some of the best researchers and practitioners in the field, was a particularly good vantage point to gauge the state of, and changes in, how people are thinking about and using AI. Here, I’ve collected some of my impressions in the hopes that they might be useful to others. If you’re curious about other people’s perspectives, Andrey Kurenkov collected some links to various talks and key trends in his recent post, which is also worth a look.
The most overarching theme I noticed at NeurIPS was the maturation of deep learning as a set of techniques. Since AlexNet won the ImageNet challenge resoundingly in 2012 by applying deep learning to a contest previously dominated by classical computer vision, deep learning has attracted a very large share of the attention within the field of AI/ML. Since then, the efforts of countless researchers developing deep learning and applying it to various problems have accomplished things like beating humans at Go, training robotic hands to solve Rubik’s cubes, and transcribing speech with unprecedented accuracy. Successes like these have generated excitement both within the AI community and elsewhere, with the mainstream impression tending towards an overestimate of what AI can actually do, fueled by the more narrowly circumscribed successes of new, largely deep-learning powered, methods. (Gary Marcus has a great recent essay talking about this in more detail.)
However, a perspective that I find more useful than “the robots are coming” is the one I heard from Michael I. Jordan when he came to Stanford to give a talk in which he described modern machine learning as the emerging field of engineering which deals with data. Consistent with this perspective, I saw a number of lines of inquiry at NeurIPS which are developing the field into more nuanced directions than “Got a prediction problem? Throw a deep net at it.” I’ll break down my impressions into three general areas: making models more robust and generalizable for the real world, making models more efficient, and interesting and emerging applications. While I don’t claim that my impressions are a representative sample of the field as a whole, I hope they will prove useful nonetheless.
Robustness and generalizability
One prominent category of work that I saw at NeurIPS was that which addressed real-word requirements for successfully deploying models other than just high test-set accuracy. While a canonical case of a successful deep learning model, like an image classifier trained on the ImageNet dataset, is successful within its own domain, the real world in which models must be grounded and deployed is complex and ambiguous in ways which models much address if they are to be useful in practice.
One of these complexities is calibration: the ability of a model to estimate the confidence with which it makes predictions. For many real-world tasks, it’s necessary not only to have an argmax prediction, but to know how likely that prediction is to be accurate, so as to inform the weight given to that prediction in subsequent decision-making. A number of papers at NeurIPS addressed better approaches to this complexity.
Another complexity is ensuring that models are assigning appropriate importance to features which are semantically meaningful and generalizable, which in one way or another includes representation learning, interpretability and adversarial examples. A story I heard that illustrates the motivation for this line of research had its origins in a hospital, which had created a dataset of (if I remember correctly) chest X-ray images with associated labels of which patients had pneumonia and which did not. When researchers trained a model to predict the pneumonia labels, its out-of-sample performance was excellent. However, further digging revealed that in that hospital, patients likely to have pneumonia were sent to the “high-priority” X-ray machine, and lower-priority patients were sent to another machine entirely. It also emerged that the machines left characteristic visual signatures on the scans they generated and that the model had learned to use those signatures as the primary feature for its predictions, leading to predictions that were not based off of anything semantically relevant to pneumonia status and which would neither yield incremental useful information in the original hospital nor generalize in any way to other hospitals and machines.
This story is an example of a “clever Hans” moment, in which a model “cheats” by finding a quirk of the dataset it is trained on without learning anything meaningful and generalizable about the underlying task. I had a great conversation about this with Klaus-Robert Müller, whose paper on the phenomenon is well worth a read. I saw a number of other papers at NeurIPS dealing with interpretability of models, as well as representation learning, the related study of how models represent data. A notable subset of this work was in disentangled representations, an approach which aims to induce models to learn representations of data which are composed of meaningfully and/or usefully factorized components. An example would be a generative model of human faces which learns latent dimensions corresponding to hair color, emotion, etc., thus allowing better interpretability and control of the task.
A final direction attracting a significant amount of attention in the “what models learn” category was that of adversarial examples, which are data points which have semantically meaningful features corresponding to one category, but less semantically meaningful features which bias a model’s prediction in a different direction – for example, a photo that looks like a panda bear to humans but which contains noise that makes a model predict it to be a tree. Recent work in adversarial training has made progress in making models more resilient to such adversarial examples, and there were a number of papers at NeurIPS in this vein. I also had a very interesting conversation with Dimitris Tsipras, who was a coauthor on this paper, which found results which suggest that image classifiers may use some less-robust features for classification, which can be perturbed to generate adversarial examples without modifying the more robust features which humans primarily focus on. This is an emerging area of investigation and the literature is worth a closer look.
All in all, it appears that the community is spending considerable effort in making models more robust and generalizable for use in the real world, and I’m excited to see what further fruit this bears.
As the power and applicability of deep learning grows, we are seeing a transition of the field from the 0-to-1 phase, in which the most important results have to do with what is or is not possible at all, to a 1-to-n phase, in which tuning and optimizing the techniques previously found to be useful becomes more important. And just as the deep learning revolution had its underlying roots in the greater availability of compute and data, so too are the most prominent directions in this area which I saw at NeurIPS concerned with improving the data-efficiency and the computational efficiency of models.
Ultimately, deep learning depends on large amounts of data to be useful, but collecting this data and labeling it (for supervised approaches) are typically the most expensive and difficult stages of applying deep learning to a problem. A number of papers at NeurIPS had to do with reducing the severity of this issue. Many had to do with self-supervised learning, in which a model is trained to represent the underlying structure of a dataset by using implicit rather than explicit labels, e.g. predicting pixels of an image from neighboring pixels or predicting words in text from adjacent words. Another approach which a number of papers dealt with is semi-supervised learning, where models are trained on a combination of labeled and unlabeled data. And finally, weakly supervised learning has to do with learning models from imperfect labels, which are cheaper and easier to collect than perfect or almost perfect ones. Chris Ré’s group at Stanford, with their Snorkel project, are prominent in this area, and had at least one paper on weakly supervised learning at NeurIPS this year. This also falls under the “systems for ML” category, mentioned in the next section.
Another prominent direction having to do with data efficiency (and also connected to representation learning) is that of meta/transfer/multi-task learning. Each of these approaches seeks to have models efficiently learn representations which are useful across tasks, thereby increasing the speed and data-efficiency with which new tasks can be tackled, up to and including one- or even zero-shot learning (learning a new task from a single example, or no examples at all). One interesting paper among many on these topics was this one, which introduces an approach to trading off regularization on cross-task vs. task-specific learning in the meta-learning setting.
Another direction in data efficiency which I noticed prominently at NeurIPS had to do with shaping the space within which models learn to better reflect the structure of the world within which they operate. This can broadly be thought of as “stronger priors” (although it seems the term “priors” itself is being used less frequently). Essentially, by constraining learning with some prior knowledge of how the world works, data can be used for learning more efficiently within this smaller space of possibilities. In this vein, I saw a couple of papers (here and here) improving models’ abilities to learn representations of the 3D world through approaches informed by the geometric structure of the world. I also saw a couple of papers (here and here, both from folks at Stanford) which use natural language to ground their representations of what they learn. This is an intriguing approach because we ourselves use natural language to ground and communicate our perception of the world, and forcing models to learn representations mediated by our languages in a sense imposes real-world priors upon the models. A final paper I’d mention in the category of priors as well is this one, which showed surprisingly good performance on MNIST of networks “trained” by architecture search alone – while this may not be immediately applicable, it is suggestive of the degree to which picking network architecture carefully (i.e. in a way that reflects the structure of a task) can make the learning process faster and cheaper.
One final direction relevant to data efficiency is that of privacy-aware learning. In some cases (and likely more to come in the future), data availability is bottlenecked by privacy constraints. A number of papers I saw, including many in the area of federated learning, dealt with how to learn from large amounts of data without compromising the privacy of the people or organizations from which the data originated.
As well as data efficiency, efficiency with regards to computational resources – i.e. compute and memory/storage – was also a prominent direction of many papers at NeurIPS. I saw a number of papers having to do with the compression of models and embeddings (the representation of the data used by models in certain settings). Shrinking models and embeddings/representations of data reduces both computational and storage requirements, allowing more “bang for the buck”. I also saw some interesting work in biologically-inspired neural networks, like this paper from Guru Raghavan at Caltech. One motivation in this area is that while there will be certain limits to how many matrix multiplications and additions can be performed per dollar/second on general-purpose hardware to push the capabilities of modern deep learning, it may be possible to use special-purpose hardware which more closely approximates the functions of biological neurons to achieve higher performance for certain tasks. I heard a combination of curiosity and skepticism around biologically-inspired approaches from fellow NeurIPS attendees: this is an area to watch for the 10+ year horizon.
Directions and applications
Finally, while at NeurIPS I also found it very interesting to get a feel for the higher-level trends in various subfields of AI/ML and a feel for the different applications now possible, or becoming possible, thanks to recent advances in research. This section is more of a smorgasbord than a narrative; skip around as interest dictates.
Graph neural networks
One area which I should mention seeing a number of papers around is that of graph neural networks. These networks are able to more effectively represent data in settings with graph-like structure, but as I know very little about this direction personally, I’ll instead refer interested readers to the page of the NeurIPS workshop on graph representation learning as a starting point into the literature.
Reinforcement learning and contextual bandits
Another area in which I saw an absolutely tremendous amount of work was that of contextual bandits and reinforcement learning (RL). A few approaches which I saw a number of papers in were hierarchical RL (related to representation learning) and imitation learning (in a sense, setting priors for models through human demonstration). I also saw a number of papers dealing with long-horizon RL, in line with recent success in RL tasks requiring planning further into the future, e.g. the game Montezuma’s Revenge. A number of papers also had to do with transferring from simulation to the real world (sim2real), including OpenAI’s striking demonstration of teaching a robotic hand to solve a Rubik’s cube in the real world after training in-simulation. I also talked to Marvin Zhang from Berkeley about a paper he coauthored in which a robot was trained on videos of human demonstrations – “demonstration to real” rather than “simulation to real” learning.
However, it is important to note that in practice, RL for the real world, i.e. hardware/robotics, is still not quite there. RL has found great success in settings where the state of the problem is fully representable in software, like Atari games or board games like Go. However, generalizing to the much messier real world has proved more difficult – even the OpenAI team behind the Rubik’s cube project spent 3 months solving the problem in-simulation and then almost 2 years getting it to generalize to a real robotic hand with a real Rubik’s cube – and even then, with far less than 100% reliability. It will be interesting to see how quickly new approaches to RL can square the circle of generalizing to the real world. I had a great conversation with Kirill Polzounov and Lee Redden from Blue River about this – they presented a paper on a plugin they developed for OpenAI gym allowing people to quickly test RL algorithms on real-world hardware. I’m excited to see how quickly “RL for the real world” progresses – if we see an inflection point like the one vision hit in 2012, the implications for robotics could be tremendous.
Natural language processing
Another area worth mentioning is NLP (natural language processing), in which I’ve done some work personally. The Transformers/transferable language model revolution is still bearing fruit, with a number of papers showing good results leveraging those techniques. I was also intrigued by a paper that claimed unprecedented long-horizon performance for memory-augmented RNNs. It will be interesting to see if the pendulum swings back from “attention is all you need” back to more traditional RNN approaches. It’s also worth noting that NLP is starting to hit its stride for real-world applications. I have a few friends and acquaintances working on startups in the field, including Brian Li of Compos.ai, whom I ran into at NeurIPS. I also enjoyed peeking into the workshop on document intelligence – it turns out NLP for the legal space is already a multi-billion dollar industry! Broadly speaking, natural language is the informational connective tissue of human society, and techniques to apply computational approaches to this buzzing web of information will only grow in the future.
Another area I’ll treat briefly, from personal ignorance rather than unimportance, is that of SysML – i.e. systems for ML and ML for systems. This is an exploding field, as evidenced by the numerous papers presented at NeurIPS and the workshops in the field. One particularly interesting talk was the one Jeff Dean gave at the ML for systems workshop – definitely worth a watch if you can find a recording (please leave a comment if you do). He and his team at Google managed to train a network to lay out ASICs much more quickly than human engineers could, and met or even surpassed the performance of ASICs laid out by humans. A number of other papers also showed compelling results in optimizing everything from memory allocation to detecting defective GPUs with the help of deep learning. A number of papers also addressed the “systems for ML” direction, such as the Snorkel paper mentioned above.
Generative models have reached a stage of significant maturity and are now being used as a tool for other directions as well as being a research direction in their own right. The performance of the models themselves is now incredible, with models like BigGAN having previously established a photorealistic state of the art for vision, and I saw a number of papers yielding unbelievably good results in conditional text-to-image generation, video-to-video mapping, audio generation, and more. I’ve been thinking about a number of downstream applications of these techniques, including some in the fashion industry and visual and musical creative tools, and I’m looking forward to seeing what emerges in industry in the years to come. Applications of generative models in other fields of machine learning has also been interesting, including fields like video compression – I talked to some folks from Netflix about this, as it may prove useful for reducing the bandwidth load on the Internet from video. (Netflix and Youtube alone use something like ⅔ of the bandwidth in the U.S.) Generative models are also being used in sim2real work in robotics, as previously mentioned.
Finally, for the sake of completeness I’ll mention a few more areas which I witnessed smaller bits of. Autonomous driving is still seeing a large and heterogenous amount of work. It seems that we’re settling into a state of incremental improvement, where both research and deployment of self-driving is going to happen in fits and starts over the next several decades (e.g. local food delivery with slow, small vehicles and truck platooning are easier problems than autonomous taxis in cities, and will likely see more commercial progress sooner). On the other hand, deep learning for medical imaging appears to be maturing as a field, with numerous refinements and applications still emerging. Finally, I was also intrigued by a paper in deep learning for mixed integer programming (MIP). Traditional “operations research” style optimization like that which can be framed as MIP problems drives tremendous economic value in industry, and it will be interesting to see if deep learning proves to be useful alongside older techniques there as well.
Modern AI/ML, largely powered by deep learning, has exploded into a large and heterogeneous field. While there is some degree of unsubstantiated hype about its possibilities, there is also plenty of genuine value to be derived from the progress of the last 7+ years, and many promising directions to be explored as the field matures. I look forward to seeing what the next decade brings, both in research and in industrial applications.
Thanks to Shengjia Zhao and Isaac Sheets for helping edit this essay.