N-grams are contiguous sequences of n-items in a sentence. N can be 1, 2 or any other positive integers, although usually we do not consider very large N because those n-grams rarely appears in many different places. When performing machine learning tasks related to natural language processing, we usually need to generate n-grams from input sentences. For example, in text classification tasks, in addition to using each individual token found in the corpus, we may want to add bi-grams or tri-grams as features to represent our documents.
PyCon HK 2017 was held on 4-5th November 2017 at the City University of Hong Kong. I gave a talk on using the LightGBM library to build gradient boosting models. The slides of the talk can be found at the link below: http://talks.albertauyeung.com/pycon2017-gradient-boosting The video of the talk can be found on Youtube at https://www.youtube.com/watch?v=Wjev_fLNeOU.
I gave a talk on deep learning and its applications in a research seminar at the Deep Learning Research & Application Centre (DLC), Hang Seng Management College on 20th July, 2017. The slides of the talk can be found at the link below: http://talks.albertauyeung.com/deep-learning
pandas is one of the most commonly used Python library in data analysis and machine learning. It is versatile and can be used to handle many different types of data. Before feeding a model with training data, one would most probably pre-process the data and perform feature extraction on data stored as pandas DataFrame. I have been using pandas extensively in my work, and have recently discovered that the time required to manipulate data stored in a DataFrame can vary hugely depending on the method you used.
Sequence Labelling in NLP In natural language processing, it is a common task to extract words or phrases of particular types from a given sentence or paragraph. For example, when performing analysis of a corpus of news articles, we may want to know which countries are mentioned in the articles, and how many articles are related to each of these countries. This is actually a special case of sequence labelling in NLP (others include POS tagging and Chunking), in which the goal is to assign a label to each member in the sequence.
(This is an updated version of the article published on my previous personal Website and quuxlab) There is probably no need to say that there is too much information on the Web nowadays. Search engines help us a little bit. What is better is to have something interesting recommended to us automatically without asking. Indeed, from as simple as a list of the most popular questions and answers on Quora to some more personalized recommendations we received on Amazon, we are usually offered recommendations on the Web.
The IPython Notebook, now called Jupyter Notebook, is a convenient and interactive Web application for fast prototyping and testing ideas in Python (and R, Julia , Scala, and others) in the Web browser. Installing it on Ubuntu is easy, but it takes a little bit more effort to deploy it on a server and have it run as a service. This article serves as a simple guide to deploy Jupyter in a Ubuntu server, using the Nginx Web server and the supervisor system.
In the past, studying social issues such as the mobility of a group of people generally required a huge amount of effort. Questionnaires would have had to be prepared, distributed, and collected after they were filled in. It was and still is a labor-intensive task when face-to-face interviews are required to obtain various personal data. Nowadays, we have more and more people connected to the Internet, and many of these Internet users participate in various kinds of social interactions on the Web.