I often find it hard to decide where I want to travel next. I don’t want to go to places I already know, and the places I don’t know – well, I don’t know them. It seems best to just trust what my friends recommend. But I still often think: Can’t I solve this problem with some data and machine learning?
Solve it with data, data, data – not
So my first idea was: I’ll just collect a ton of data on different cities:
- Get data from Numbeo and other websites like it;
- Do some topic modelling on Wikipedia entries and city guides;
- Then use clustering algorithms to group similar cities together.
So the cities grouped in the same clusters as the cities I already like are the cities I should travel to next.
But I had a gut feeling that this plan wouldn’t work out. Why?
- The data is not human enough. What I’m interested in is what a city will feel like to me. Is that kind information really in quantitative city data?
- When I travel, I don’t look for cities that are just like the cities I went to in the past. A similarity search probably won’t find the surprising recommendations that I’m looking for.
So I put the idea aside for a while.
The missing piece
Last month, I decided to go to Cape Town for a few weeks. It wasn’t easy to pick a destination, and during my research I often went to NomadList to get some inspiration. And I noticed something I hadn’t seen before: The members on NomadList have public profiles. And the public profiles include their travel histories.
Actual detailed travel histories for thousands of digital nomads!!
This I can use! It’s exactly the kind of data I need. I find the Nomads whose taste is most similar to mine and simply look at which cities they went to.
But doing all that by hand is a bit of a stretch – I don’t want to manually look through thousands of Nomads. So I’ll use a collaborative filtering algorithm.
A quick recap on recommenders
The 2 basic types of recommendation algorithms
A recommender is the kind of software that Netflix or Spotify’s Discover Weekly uses to give you personalised recommendations. It learns from what you’ve clicked, liked, watched, or listened to in the past, and then recommends things that fit your taste.
There are many different kinds of recommendation algorithms, but most of them are built on these two basic types:
- Algorithms that use data about the thing you want recommendations for: What other cities have a similar data signature to the cities I like? This is how I first looked at it, but I was afraid that approach would just produce obvious, boring recommendations.
- Algorithms based on what other people liked: Based on the cities other people with similar taste have visited, what cities would I like?
The first kind is called a content based recommender and the second is collaborative filtering.
The beauty of collaborative filtering
Collaborative filtering algorithms are so elegant! Why?
- It’s simple. The collaborative filtering algorithm only needs one input: the places other people liked. It doesn’t need any data at all about the cities themselves.
- It uses the knowledge implied in the travel histories. Each travel history is the outcome of a string of decisions made* by a person. They used their experiences, advice from friends, data from research, and intuition to come to these decisions.
- This is why a collaborative filtering algorithm can give recommendations that are both surprising and accurate. Those are the kinds of recommendations I’m looking for – cities that are me, even though they might not be similar to a place I’ve been before in any obvious way.
Getting those travel histories
So I simply went through all the pages on NomadList. Whenever my software encountered a member page, it saved their travel history.
I can’t guarantee I got all the member pages, but I think I did. I found 3,640.
Looks good so far!
A quick look at the data
There are 1,152 members who don’t have any trips on their member page. Including them won’t help our recommender, so I removed them.
I now have 2,488 Nomads remaining, with a total of 36,822 trips recorded. And there are 4,247 unique cities included in those trips.
This is what the data looks like now:
Let’s make some recommendations!
Now let’s train a collaborative filtering algorithm on the data. I’m using LightFM – it’s a powerful recommender library in Python.
Running the algorithm and training a model takes about 10 seconds. Now I have a recommender that’s ready to make some recommendations!
Let’s try it!
First I’m going to write down a few cities I like, say Berlin, Nuremberg, Barcelona, and Cape Town.
- Berlin Germany
- Cape Town South Africa
- Barcelona Spain
- Nuremberg Germany
I give this data as input to the recommender, and then it calculates how much it thinks I will like each of the 4,247 cities in the dataset.
How does the recommender make recommendations?
- It calculates the similarity between my taste and each of the NomadList members in the dataset, based on the cities we each traveled to.
- By paying more attention to the members who have similar taste and less attention to those with different taste, it makes a list of cities I might like.
- It gives each city a score that sums up how often that city would have been recommended by the Nomads with similar taste to mine.
It’s as if I just asked two thousand friends for advice and then pooled their answers, paying more attention to advice from friends who share my taste.
So what did the recommender recommend to me? Here are the top 10 cities:
Not bad! But we’re not done. A single test run is not enough. Next, we need to make sure our recommendations are actually good. Stay tuned for Part 2: Improving the Recommendations.
What do we know about recommenders?
Our machine learning team at Data Revenue recently spent a lot of time getting to know recommendation engines. We built:
- An elegant tactic recommender for Ladder.io – they even published a blog post on the project;
- A huge movie recommender for the largest media conglomerate in Europe;
- A hotel recommender for a big German travel site.
We know quite a bit about recommenders now – how to build them, scale them, and use them in production. If you have any questions, don’t hesitate! Write to me at m.schmitt [at] datarevenue.de