Thoughts from Lisbon LxMLS on ML, LLMs/ChatGPT and more

LxMLS 2023 summer school (photo credit: Fernando Batista)

Lisbon will forever be associated with exploration given its position and history. Cabo da Roca nearby is the last land you’ll see in mainland Europe before you head across ocean, as countless Portuguese explorers have done so in the past. The sea also impacts the diet: Bacelhau, salted cod is the national dish of Portugal. Perhaps less known is the fact that Lisbon has an annual machine learning summer school, LxMLS, which was running for its thirteen time this July. I attended LxMLS last week, which was great, and I’ll write about some of my takeaways from the summer school, ML and LLMs in general.

I’ve been to many conferences over the years, whether as a presenter or an attendee, and have lectured at Queen Mary University of London. However, this was the first time in a while, when I was a student again! In this age where there is so much material available, there is still something to be said for going on a course. Ok, I am somewhat biased given I teach part time at QMUL and also on QDC and MLI, but it really does help when you’re able to interact with a class and a teacher, and are able to ask questions. Also from a teaching perspective it was good to see things from the other side. Most of the attendees were PhD students, but there were also practitioners from industry like myself. As well as the lecturers, there was also a team of monitors/teaching assistants.

 

Summary of course topics
The course provided a refresher to start on some various linear models in ML taken by Mario Figueiredo and also neural networks, taught by Bhiksha Raj. Later days were devoted to particular topics, with a focus on NLP. This included sequence models taught by Noah Smith. There was also day on transformers by Kyunghyun Cho, which was new for this year. The day on multimodal learning was taught by Desmond Elliot. Lastly, the causality day was led by Adele Riberio. In the afternoon, there were labs to go over the various topics.

One thing I go over again and again to students I teach is the importance of practicing to code, and labs are still an invaluable way to learn, in the presence of other students and teaching assistants. The evening was rounded up with shorter talks from invited speakers presenting their research. It should be noted that all the videos and slides from the lectures are available online, with links on the LxMLS website.

 

All the buzz on LLMs
Perhaps unsurprisingly, the topic of LLMs (large language models) came up repeatedly during the course, given the recent buzz around ChatGPT, which is a type of LLM. Their promise is that we can give them enough data, they can learn everything in the world (I’m someone paraphrasing marketing speak here, but you get the idea!). They essentially avoid the need for creating specific models for each task (eg. one for sentiment analysis, another for topic categorization), as noted in Sara Hooker’s talk.

The problem is that whilst they might bamboozle us with doing things which seem innately human, like writing poetry etc, they can struggle with basic tasks. In Yejin Choi’s talk, she showed research results showing that they hadn’t mastered concepts like multiplication. In a sense, we don’t need LLMs to do everything. Indeed, Desmond Elliott noted there are many people doing exciting things in ML, which do not involve LLMs. I would agree with that! After all, what we’re doing at Turnleaf Analytics, which Alexander Denev and I cofounded, is forecasting inflation using ML. Of course some of the features of our time series forecasting models are derived from text using NLP, but it isn’t all about using LLMs.

 

One of the issues with using an LLM in this context is that it becomes difficult to do things point in time. Let’s say you ask an LLM to forecast inflation in the past, the model may have been trained using text after that point, resulting in look ahead basis. With a dedicated news data vendor (and there are many, ranging from Bloomberg to RavenPack) every single article read has an associated time stamp, and it is possible to ask vendors which periods the models they use were trained on. In our problem case of inflation, for time series forecasting, this means making sure that our training set does not encompass the whole time series (which would result in information leakage) but is instead done on a rolling basis.

Maybe in the future it’ll become easier to train LLMs on a rolling basis, and it’ll be possible to query different vintages of the model, but to my knowledge this isn’t something that is possible today. But maybe it’s ok that LLMs are not suitable for every problem we might attempt to throw at them and in particular that they don’t know everything. Furthermore, for example, we already know how to do multiplication, and for time series, we already have a way of avoiding information leakage.

Another point about LLMs is that many problems will require more than text to solve. After all, if we think of a human interacts with the world, it is not purely about reading text. In our case of forecasting inflation, again whilst text is input, so are many other types of variables whether it’s macroeconomic data, market data, other sorts of alternative data etc. Desmond Elliott’s module on multimodal learning was particularly enlightening on this point of using multiple types of data in problems. He showed how to use both images and text together to learn how to label images. He also noted that for many languages other than English, there wasn’t as much data to train LLMs.

Moreover, the point about the scale of LLMs came up many times during the course, with Sara Hooker’s talk specifically covering this point, comparing a number of these models. The costs of training these models has ballooned and the amount of data they ingest is huge. The models are so large they cost a lot to run, and some like ChatGPT are also proprietary, although there are now open source alternatives emerging such as Meta’s LLaMA models. There is also a limit of how much data there is actually on the web for them to ingest. A lot of data is also proprietary and is not freely available online, so wouldn’t ever be used for training such a model. What we have observed in our case of forecasting inflation at Turnleaf Analytics, is that being able to curate a dataset for relevant input features is a crucial part of the process, often gathered from many sources.

 

Pre 2022 web was different pre generative AI
Furthermore, as Desmond Elliott pointed out, the pre-2022 web is likely to be very different to the era of the web after that. With LLMs creating text and Stable Diffusion rendering images, there is a potential for a mass of machine created we. content. Indeed, you can imagine a scenario where the web becomes less a place for humans to post, but more for machines to output text and images. How will this impact how LLMs are trained, if most of their training set is made up of machine generated material?

Another point made about LLMs is the difficulty to understand sources in any reply they give. Andrew Lampinen from Google DeepMind, discussed trying to find causality in language models. He noted for example that Google Bard is beginning to attribute sources in its responses.

Perhaps the future will revert SLMs (small language models!), which are simpler and can run easily on any device like a mobile phone. Of course, all this is just conjecture, but it seems unlikely that LLMs will be the last technology in this area, and indeed LLMs have certainly not solved everything! There will be something after them! Most important is not so much the technology, but how we as a society manage (and regulate) these models. They have a potential to impact our society in many ways, most of which I suspect we haven’t even considered.

 

What’s the best thing about LxMLS?
What was the best thing about LxMLS? It was meeting all the other folks and making friends with many of them at the event. It was very interesting to chat with other attendees about their research areas, which spanned across many different NLP areas.

If you get an opportunity to do a course on ML, I’d definitely recommend it. They can help solidify existing knowledge and also teach you about new areas. They are also great ways to network and learn about what others are doing in this exciting space. Hopefully, I’ll be back at LxMSS in the future on the other side, maybe as a monitor, let’s see! I never managed to have a burger during my week at LxMLS, so that is reason enough to return to Lisbon!