import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from string import punctuation

I went to the Kickstart and Ideaspace websites and scraped the descriptions of the startups they funded.

And by scraped, I mean I cut-and-paste stuff into a Google Sheets document.

raw = pd.read_csv("../files/Philippine Startups - Sheet1.csv")
descriptions = raw['Long Description']
0    Arthrologic designs and develops a TKA (Total ...
1    ​BluLemons Gaming Studio is an all-Filipino th...
2    Croo enables people to swiftly send informatio...
3    The Company has the opportunity to create the ...
4    Despite current transponder technologies avail...
Name: Long Description, dtype: object
raw_words = word_tokenize(" ".join(descriptions))
stop_words = set(stopwords.words('english') + list(punctuation))

words = [w.lower() for w in raw_words if w.lower() not in stop_words and not w.isdigit() and len(w) > 3]
word_str = " ".join(words)
with open("../files/phstartupwords.txt","w") as f:

Lazy Wordcloud Visualization

Enter the contents of the file generated into, and manually remove the words that occur less than 3 times:


