99designs Tech Blog

Adventures in web development

Swiftly and Machine Learning: Part 1

by Daniel Williams

In this series of guest blog posts, 99designs intern Daniel Williams takes us through how he has applied his knowledge of Machine Learning to the problem of classifying Swiftly tasks.

Introduction

Swiftly is an online service from 99designs that lets customers get small graphic design jobs done quickly and affordably. It’s powered by a global network of professional designers who tackle things like business card updates and photo retouching in 30 minutes or less – an amazing turnaround time for a service with real people in the loop!

Given that we have a pool of designers waiting for customer work, how can we best allocate them tasks? Currently we take a naive but fair approach: assign each new task to the designer that has been waiting in the queue the longest. But there’s room for improvement: designers excel at different types of tasks, so ideally we’d match tasks to designers based on expertise. To do this we need to be able to categorise tasks by the skills they require.

In today’s approach, we’ll try to solve the problem with machine learning. The first step is to find a way to automatically categorise a design brief, with categories forming our “areas of expertise”. The next will be figuring out what categories a particular designer is good at. If we can build solid methods for both these two steps, we can begin matching designers to tasks.

In this post, I’ll introduce the problem and walk through some attempts at applying unsupervised techniques for discovering task categories. Follow along, and you may recognise a similar situation of your own that you can apply these methods to.

Swiftly tasks

Swiftly tasks are meant to be quick to fire off and highly flexible. The customer fills in a short text box saying what they want done, uploads an image or two, and then waits for the result. This type of description, plain text and raw images, is highly unstructured. Since image recognition and indexing is its own hard problem, we’ll skip the images for now and focus on the text.

Here’s a couple of examples:

Task A

More Handsome

  1. Remove the man’s glasses.

  2. Make the man’s face MORE HANDSOME.

Task B

In my logo, there is a “virtual” flight path of an airplane. I have had comments that the virtual flight path goes into the middle of the Pacific Ocean for no reason - not a logical graphic. I want you to “straighten” out the flight path - as shown on the Blue lines in the attached PDF titled “Modified_Logo.PDF.” I still want the flight path lines to be in white, with black triangles separating the segments. I just want the segments to be straighter and not go over the ocean as in the original. Please contact me for any clarification. I am uploading the EPS and AI files as well to make the change. Thank you!

How might a human classify these tasks? I would probably classify the first as “image manipulation” and the second as “logo refresh,” although the second could just as easily also be “image manipulation” as well. Already you can see that classifying these sorts of tasks into concrete categories is perhaps going to be more art than science.

Figuring out the categories

The first major problem is deciding on a sensible set of categories. This has turned out to be more difficult than I first imagined. Customers use Swiftly for a wide range of tasks. Plus, there’s quite a bit of overlap — one Swiftly task is sometimes a combination of multiple small tasks. My initial approach, just to get a feel for the data, was to eyeball 100 task briefs and attempt to invent categories and classify them manually. The result of this process:

Category Number of Tasks
Logo Refresh (Holidays) 34
Logo Refresh 11
Copy Change 11
Vectorise 13
Resize/Reformat 17
Transparency 1
Image Manipulation 10
Too hard to classify 3

A large number of the instances were hard to classify, even for a human! I was not 100% happy with the categories that I came up with, with many tasks not fitting comfortably in the buckets. I decided to apply some unsupervised machine learning techniques in any attempt to cluster design briefs into logical groups. Can a machine do better?

Unsupervised clustering

I explored software called gensim, an unsupervised natural language processing and topic modelling library for Python. Gensim comes equipped with various powerful topic modeling algorithms, which are capable of extracting a pre-specified number of topics and associating words with those topics. It also helps with converting a corpus of documents into various formats (e.g. vector space model). The main algorithm that I made use of is called Latent Dirichlet Allocation. The first step is converting the text corpus into a model that allows for the application of mathematical operations.

The vector space model

To apply mathematical-based algorithms to natural language, we need to convert language into a mathematical format. I used a simple model known as the bag-of-words vector space model. This model represents each document as a vector, where each dimension of the vector corresponds to a different word. The value of a word in a particular document is just the number of times it appeared in that particular document. The vector will have n dimensions, where n is the total number of terms in the whole collection of documents. Let’s try an example.

Say we have the following collection of documents:

  1. The monster looked like a very large bird.
  2. The large bird laid very large eggs.
  3. The monster’s name was “eggs.”

After finding all the unique words (“the,” “monster”, etc.) and assigning them an index in the vector, we can count those words in each document to turn each document into a word frequency vector:

  1. (1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0)
  2. (1, 0, 0, 0, 0, 1, 2, 1, 1, 1, 0, 0)
  3. (1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1)

Corpus pre-processing

If you just split your text into words on whitespace and apply this naively, the results can be messy. On the one hand text contains punctuation we want to ignore. On the other, this is going to work best when we have lots of words in common between the documents. Do we really want to treat “Egg”, “egg” and “eggs” as different words? To get the best results, you deal with these kinds of problems in a pre-processing step.

In our pre-processing, we:

  1. Split the document description into individual tokens (i.e. words)
  2. Put tokens into lower case
  3. Remove punctuation from start and end of tokens
  4. Remove stop words (e.g. “and”, “but”, the”, …)
  5. Perform stemming

Stemming is the process where words are reduced to their “stem” or root format, basically chopping any variation off their end. For example, the words “stemmer,” “stemming” and “stemmed” would all be reduced to just “stem”. I used the nltk implementation of the snowball stemmer to perform this step. All of these steps can be performed very easily in Python:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
import nltk

PUNCTUATION = """!@#$%^&*()_+=][{}'-";:/?\.,~`"""

def tidy_text(task_description):
  """ Does the following:
 1. Tokenises words
 2. Removes punctuation
 3. Removes stop words
 4. Puts words through the snowball stemmer"""

  stemmer = nltk.stem.snowball.EnglishStemmer()
  stopwords = nltk.corpus.stopwords.words('english')

  outwords = []
  for word in task_description.split():
      word = word.strip(PUNCTUATION).lower()
      if word not in stopwords:
          outwords.append(stemmer.stem(word))

  return outwords

Running our earlier bird examples through this function, we get:

1
2
3
['monster', 'look', 'like', 'larg', 'bird']
['larg', 'bird', 'laid', 'larg', 'egg']
['monster', 'name', 'egg']

This process reduces the noise in the vector space model, because tokens that mean the same thing are assigned the same token (through stemming and punctuation and caps normalisation) and words that probably do not add any meaning are removed (through stop word removal). Eventually, I expected the pre-processing steps to be much more in depth, but for now this should get us started.

Latent Dirichlet Allocation (LDA)

LDA is an algorithm developed to automatically discover topics contained within a text corpus. Gensim uses an “online” implementation of LDA, which means that it breaks the documents into chunks and regularly updates the LDA model (as opposed to batch which processes the whole corpus at once). It is a generative probabilistic model that uses Bayesian probabilities to assign probabilities that each document in the corpus belongs to a topic. Importantly, the number of topics must be supplied in advance. Since I did not known how many topics might exist, I decided to apply LDA with varying numbers of topics. For example, if we did an LDA with 5 topics, the result for a single document might look like this:

1
[(0, 0.0208), (1, 0.549), (2, 0.0208), (3, 0.366), (4, 0.0208), (5, 0.0208)]

Which means LDA places that document 2% in topic 0, 55% in topic 1, 20% in topic 2 and so on. For the simple analysis I am doing, I just want the best guess topic. We can convert the result from probabilistic to deterministic by just picking the best guess.

1
max(x, key=lambda lda_result:lda_result[1])

Much of my approach in the following segments is based on Gensim’s author’s LDA guides.

Pre-processing for LDA

I extracted ~4400 job descriptions from the Swiftly database. I removed formatting of each, and applied the pre-processing steps described above (tokenisation, stemming, stop word removal etc.). The result was a plain text file, with each pre-processed Swiftly job on a new line, like this:

within attach illustr file top left window white background we’d like follow item creat use 2 version complet logo also word there vertic version horizont version 1 creat version taglin get organ 2 logo 2 put 4 logo 2 taglin 2 without transpar background

need make titl look better take text top adjust remain element offic los angel mayor eric garcetti partnership ucla labor center rosenberg foundat california communiti foundat california endow asian americans/pacif island philanthropi cordial invit close recept also add hashtag bottom descriptor dreamsummer13 take rsvp august 19 2013

need logo revamp want logo look great monogram ex chanel gucci lv etc logo consist letter r&b want classi font letter either back back intertwin ex roll royc logo ysl gucci lv etc

tri new font similar attach chang colour solid blue rather way edg fade white/light blue look font use www.tradegecko.com logo style font look

want logo tag line made bigger line logo origin cut close caus distort need logo deliv format includ transpar also imag clear need enhanc

remov man glass make man face handsom

I then used the gensim tools to create the vector model required for LDA. On the recommendation of the gensim authors, I also removed all tokens that only appeared once. The doc2bow function used in the MyCorpus class below converts the document into the vector space format discussed above.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
from gensim import corpora, models, similarities

# pre-process swiftly jobs, each job on a newline
CORPUS = "StemmedStoppedCorpus.txt"

def corpus():
  for line in open(CORPUS):
      yield dictionary.doc2bow(line.split())



  # create dictionary mapping between text and ids
  dictionary = corpora.Dictionary(line.split() for line in open(CORPUS))

  # find words that only appear once in the entire doc set
  once_ids = [tokenid for tokenid, docfreq in dictionary.dfs.iteritems() if docfreq == 1]

  # remove once words
  dictionary.filter_tokens(once_ids)

  # "compactify" - removes gaps in ID mapping created by removing the once words
  dictionary.compactify()

  # save dictionary to file for future use
  dictionary.save("swiftly_corpus.dict")

  # create a corpus object
  swiftly_corpus = MyCorpus()

  # store to disk, for later use
  corpora.MmCorpus.serialize("swiftly_corpus.mm", swiftly_corpus)

Regarding the above code, the MM file is a file format known as Matrix Market format, which represents a matrix of sparse vectors. The dictionary file above simply maps the word_id integers that are used in the MM format to the actual word each id represents.

Applying LDA

Now that the corpus has been stored as a matrix of vectors, we can apply the LDA model and start clustering the Swiftly jobs. This is done with the following lines of code. We can generate different models by changing the num_topics argument in the ldamodel.LdaModel() function.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import logging, gensim

# pre-processed swiftly data files
DICTIONARY = "swiftly_corpus.dict"
MM_FILE = "swiftly_corpus.mm"

# the number of topics to create
N_TOPICS = 6

# set up logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

# load mapping dictionary
id2word = gensim.corpora.Dictionary.load(DICTIONARY)

# load market matrix file
mm = gensim.corpora.MmCorpus(MM_FILE)

# create the lda model.  Use Chunks of 500 documents, update model once per chunk analysis, and repeat 3 times.
lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=N_TOPICS, update_every=1, chunksize=500, passes=3)

# save the results
lda.save("swiftly_lda{0}_model.lda".format(N_TOPICS))LDA Results

We can use gensim’s lda.showtopics() method to get a sense of the different clusters that LDA has picked out.

1
2
3
4
5
print "LDA where K = {0}\n".format(N_TOPICS)
count = 0
for i in lda.show_topics(topics=-N_TOPICS, topn=20, log=False, formatted=True):
  print "TOPIC {0}: {1}\n".format(count, i)
  count +=1

Where N_TOPICS = 6, the results are:

LDA where K = 6

TOPIC 0: 0.033*logo + 0.028*holiday + 0.025*busi + 0.023*name + 0.021*card + 0.020*chang + 0.019*follow + 0.013*christma + 0.011*incorpor + 0.009*font + 0.009*compani + 0.008*attach + 0.008*line + 0.008*text + 0.007*need + 0.007*replac + 0.007*like + 0.007*2 + 0.006*1 + 0.006*would

TOPIC 1: 0.032*background + 0.029*like + 0.020*logo + 0.020*imag + 0.018*white + 0.018*make + 0.018*need + 0.018*would + 0.017*look + 0.016*color + 0.013*transpar + 0.013*font + 0.012*text + 0.012*black + 0.012*use + 0.010*chang + 0.010*want + 0.010*word + 0.010*one + 0.009*also

TOPIC 2: 0.049*logo + 0.042*file + 0.032*exist + 0.032*creativ + 0.029*element + 0.028*fun + 0.028*etc + 0.025*take + 0.024*need + 0.020*add + 0.020*vector + 0.020*festiv + 0.017*snowflak + 0.015*tree + 0.013*attach + 0.013*use + 0.011*ai + 0.010*snow + 0.010*ep + 0.009*convert

TOPIC 3: 0.031*logo + 0.023*need + 0.020*file + 0.016*attach + 0.014*like + 0.014*look + 0.014*color + 0.013*make + 0.012*use + 0.010*imag + 0.010*size + 0.008*would + 0.008*design + 0.008*2 + 0.008*want + 0.008*halloween + 0.007*format + 0.007*version + 0.007*creat + 0.007*chang

TOPIC 4: 0.040*x + 0.031*imag + 0.024*cover + 0.020*px + 0.018*photo + 0.018*app + 0.013*pictur + 0.012*need + 0.011*size + 0.010*book + 0.009*icon + 0.009*screen + 0.008*googl + 0.008*73 + 0.008*2 + 0.007*attach + 0.007*suppli + 0.007*psd + 0.006*back + 0.006*chang

TOPIC 5: 0.055*celebr + 0.049*add + 0.029*decor + 0.027*logo + 0.024*take + 0.017*banner + 0.015*facebook + 0.014*bottom + 0.014*pumpkin + 0.013*profil + 0.012*bat + 0.012*spooki + 0.012*skeleton + 0.011*side + 0.010*right + 0.009*text + 0.009*say + 0.008*pictur + 0.008*element + 0.008*etc

The number before each token represents how discriminating that token is for the category. Ideally, by eyeballing the discrimiating tokens for a topic we could understand and identify it, giving it a useful name. As you can see, this proved to be difficult. I suspected that there are probably more than six unique categories of tasks on Swiftly, so I run LDA with N_TOPICS set to different numbers. With 15 (this time just top 10 words, without numbers, formatted into a table for easier comprehension), the results are:

TOPIC
1
TOPIC
2
TOPIC
3
TOPIC
4
TOPIC
5
TOPIC
6
TOPIC
7
TOPIC
8
TOPIC
9
TOPIC
10
TOPIC
11
TOPIC
12
TOPIC
13
TOPIC
14
TOPIC
15
imag need element tree yellow creativ celebr file like chang festiv logo need name x
file imag exist snow use take logo background look color pdf card attach follow cover
pictur attach etc santa view add decor logo snowflak blue send christma page busi photo
like size logo thanksgiv new fun make holiday logo code file use imag logo like
line file icon leav servic logo etc need would red need font text chang would
high word halloween gold replac pumpkin word vector color font back attach websit incorpor look
resolut 2 app outlin team spooki possibl transpar want dark page like px compani suppli
photoshop make add make super bat add white make green digit creat pictur card 73
layer logo like fall feel skeleton text png someth match psd file use line websit
hand 1 theme turkey color offer bit ai font panton version busi photo replac templat

At this point, I realised that more pre-processing would be requried to get this right. For instance, it seemed strange that in topic 15 the most discriminating word is ‘x’. Looking closer, I realised that this is because topic 15 represents a resize / reformatting job brief. The ‘x’ gets picked out because a large number of customers are specifying dimensions (e.g. 200px x 500px). I was also surprised to find out that ‘73’ was so discriminating, but a little bit of digging revealed that a twitter profile picture is 73x73 pixels. To address this problem, I plan to use a preprocessing step called Lemmatisation.

Lemmatisation is useful for grouping things like numbers, colours, URLs, email addresses and image dimensions together so that different values are treated equally. For example, if there is a specific colour mentioned in a brief, we don’t really care what the specific colour is—we just care that the brief mentions a colour. In our case, we believe that a brief containing a colour (e.g. #FF00FF) or image dimensions (e.g. 400x300) might give us clues about what type of task it is so we convert anything that looks like these to the tokens $COLOUR and $DIM.

Despite the shortcomings of my pre-processing, this clustering task has picked out some interesting topics! Some, as is probably inevitable, are “junk topics”. Further, seasonal words seem to appear in lots of topics, which is a strange result. Despite this, many of the topics are classifiable. Topic 5 was interesting, where ‘yellow’ was such a discriminating term. A very quick (and non-scientific) review of the data suggests that people often do not like the colour yellow (I agree with them!) and want it changed. An attempt to name the topics from the table above:

  • Topic 1: Change an image so it’s in higher resolution
  • Topic 3: Change or create a logo or icon, perhaps for a smartphone app
  • Topic 4: Edits of a seasonal nature (Christmas, Thanksgiving)
  • Topic 5: Replace yellow (?!)
  • Topic 6: Halloween edits
  • Topic 8: Vectorisation task, e.g. “take this png file, turn it into a vector on a transparent background”
  • Topic 10: Change a colour in some way, often a font. “Panton” is a stemmed form of “pantone”, a popular colour chart
  • Topic 14: Change copy or update information on a business card
  • Topic 15: Resize or reformat a photo, often for social media purposes

Having to provide the number of topics to LDA, before you even know what’s reasonable, feels like a chicken-and-egg problem. It’s possible to try different numbers of topics and eyeball the results, but at times it felt a bit too much like guesswork. Nevertheless, I view these results as a decent “proof of concept”. It’s reassuring that a computer can find categories like this, and suggests that with more tweaking and a nicely labelled dataset, the job of automatically classifying Swiftly task briefs is entirely possible!

Next time…

That wraps up my experiments with unsupervised classification for this post. Next time, I plan to discuss my efforts after I settle on the Swiftly categories. I’d like to develop a nice labelled training data set (most likely using Amazon’s Mechanical Turk service), and then experiment with supervised machine learning techniques. I will also detail my efforts at a developing a more sophisticated pre-processing procedure. Tune in!

About Daniel

Daniel Williams is a Bachelor of Science (Computing and Software Science) student at the University of Melbourne and Research Assistant at the Centre for Neural Engineering where he applies Machine Learning techniques to the search for genetic indicators of Schizophrenia. He also serves as a tutor at the Department of Computing and Information Systems. Daniel was one of four students selected to take part in the inaugural round of Tin Alley Beta summer internships. Daniel is an avid eurogamer, follower of “the cricket”, and hearty enjoyer of the pub.