2019/2020

Vector

A vector is an ordered finite list of numbers. Its entries are called elements of the vector. The dimension of the vector is the number of elements it contains.

Denoting an \(n\)-dimensional vector using the symbol \(\pmb{a}\), the \(i\)-th element of the vector \(\pmb{a}\) is denoted with \(a_i\), where the subscript \(i\) is an integer index that runs from 1 to \(n\).

A vector is said to be sparse if many of its elements are zero, i.e. if \(a_i=0\) for many \(i\).

Vector Space

A vector space is a collection of vectors, which may be added together and multiplied by numbers, called scalars.

Two vectors of the same size can be added together by adding the corresponding elements:

\[\pmb{a}+\pmb{b} = (a_1+b_1, ..., a_n+b_n)\]

A vector can be multiplied by a scalar, \(k\), by multiplying every element of the vector by the scalar.

\[k\cdot \pmb{a}=(ka_1, ..., ka_n)\]

The vector space can be extended with additional structures.

  • Inner product of two \(n\)-vectors:

\[\pmb{a}\cdot\pmb{b}=a_1b_1+...+a_nb_n\]

  • Euclidean norm of a \(n\)-vector:

\[||\pmb a||=\sqrt{a_1^2+...+a_n^2}\]

  • Euclidean distance between two \(n\)-vectors:

\[dist(\pmb{a}, \pmb{b}) = ||\pmb{a} - \pmb{b}||\]

  • Angle between two \(n\)-vectors:

\[\theta = arccos\Bigl(\frac{\pmb{a}\cdot\pmb{b}}{||\pmb{a}||||\pmb{b}||}\Bigl)\]

Examples

Location: 3-vector is used to represent a location or position of some point in 3-dimensional (3-D) space. The elements of the vector give the coordinates \((x,y,z)\).

Color: A 3-vector can represent a color, with its entries giving the Red, Green, and Blue (RGB) intensity.

Portfolio: An \(n\)-vector can represent a stock portfolio or investment in \(n\) different assets.

Time Series: An \(n\)-vector can represent a time series or signal, that is, the value of some quantity at different times.

Images: A black and white image can be represented by a vector of length \(m \times n\), with the elements giving grayscale levels at the pixel locations, typically ordered column-wise (\(n\)) or row-wise (\(m\)).

Features: An \(n\)-vector can collect together \(n\) different quantities that pertain to a single object (e.g. age, height, weight, blood pressure, temperature, gender).

Text Data Vectorization

Processing natural language text and extract useful information requires the text to be converted into a set of numerical features.

Word Embeddings or Word Vectorization is a methodology in NLP to map text to a corresponding vector of real numbers which can be used to support later automated text mining algorithms.

The process of converting text into numbers is called Vectorization.

Vector Space Model

Definitions

Terms are generic features that can be extracted from text documents. Typically terms are single words, keywords, n-grams, or longer phrases.

Documents are represented as vectors of terms. Each dimension corresponds to a separate term. If a term occurs in the document, its value in the vector is non-zero. Several different ways of computing these values, also known as (term) weights, have been developed.

\[\pmb{d}=(w_1, ..., w_n)\]

The Corpus represents a collection of documents (the dataset). It is represented as a vector of documents, i.e. a matrix of terms.

\[ \textbf{C} = \begin{pmatrix} \pmb d_1 \\ \pmb d_2 \\ \vdots \\ \pmb d_m \end{pmatrix} = \begin{pmatrix} w_{1,1} & w_{1,2} & \cdots & w_{1,n} \\ w_{2,1} & w_{2,2} & \cdots & w_{2,n} \\ \vdots & \vdots & \ddots & \vdots \\ w_{m,1} & w_{m,2} & \cdots & w_{m,n} \end{pmatrix} \]

Each element \(C_{d,t}=w_{d,t}\) represents the weight of the \(t\)-th term in the \(d\)-th document.

The Vocabulary is the set of all unique terms in the corpus.

Remarks

  • The vocabulary corresponds to the canonical base of the vector space.
  • The dimension of the space, \(n\), is the number of the elements in the vocabulary.
  • Each document vector has exactly \(n\) elements, one for each term in the vocabulary. If a term does not occur in the document, its value in the vector is zero.
  • Vector operations can be used to compare documents.

Bag of Words (BOW)

With Bag of Words (BOW), we refer to a Vector Space Model where:

  • Terms: words (more generally we may use n-grams, etc.)
  • Weights: number of occurences of the terms in the document.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(analyzer = "word", ngram_range = (1,1)) 

# Learn the vocabulary dictionary and return term-document matrix.
vectorizer.fit_transform(corpus)

CountVectorizer Documentation

TF-IDF

With TF-IDF (Term Frequency-Inverse Document Frequency), we refer to a Vector Space Model where:

  • Terms: words, n-grams, etc.
  • Weights: higher weight to terms that are frequent in the document but not common in the corpus.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(analyzer = "word", ngram_range = (1,2)) 

# Learn the vocabulary dictionary and return term-document matrix.
vectorizer.fit_transform(corpus)

TfidfVectorizer Documentation

Let \(n_{d,t}\) denote the nuber of times the \(t\)-th term appears in the \(d\)-th document.

\[TF_{d,t} = \frac{n_{d,t}}{\sum_i n_{d,i}}\]

Let \(N\) denote the total number of documents and \(N_{t}\) denote the nuber of documents containing the \(t\)-th term.

\[IDF_t = log\Bigl(\frac{N}{N_t}\Bigl)\] TF-IDF weight:

\[w_{d,t} = TF_{d,t} \cdot IDF_t\]

Dimensionality Reduction

Feature Extraction

Feature extraction is intended to extract informative and non-redundant features, facilitating the subsequent learning.

  • Stop-words removal
  • Stemming/lemmatization
  • Normalization
  • Removing rare terms

It is especially important in text mining due to the high dimensionality of text features and the existence of irrelevant (noisy) features.

Feature Space Reduction

In Vector Space Models, a text will typically be a very sparsely populated vector living in a very high-dimensional space. It is often desirable to reduce the dimension of the feature space while retaining as much information as possible.

Given an \(d \times t\) matrix \(\textbf{C}\) with \(t\) large, it is often desirable to project the rows onto a smaller-dimensional space, giving a matrix of shape \(d \times k\) with \(k \ll t\).

We would like this projection to keep the variance of the samples as large as possible, because this corresponds to losing as little information as possible.

Principal Component Analysis (PCA)

A standard method for feature space reduction is Principal Component Analysis, which projects a set of points onto a smaller dimensional affine subspace of “best fit”.

Singular value decomposition

In general, if we want \(k\) features, it is optimal to take the \(k\) eigenvectors of \(\textbf{C}^\intercal \textbf{C}\) with maximal eigenvalues.

Finding the singular vectors along with their eigenvalues is essentially the process known as singular value decomposition (SVD).

In PCA, we (usually) first perform some normalizations: we scale the columns to have variance 1 and translate them to have mean 0 before applying SVD. Geometrically, this amounts to normalizing the features to the same scale, i.e. giving them the same magnitude before applying SVD.

Take Home Concepts

Take Home Concepts

  • Processing natural language text requires the text to be converted into a set of numerical features.
  • Text data can be represented as vectors.
  • BOW and TF-IDF are models to vectorize text data.
  • Vectors representing text data are usually sparse and high-dimensional.
  • PCA reduces the dimension of the vector space while retaining as much information as possible.

References

References