Evaluation Science & Technology E-mail: [email protected] Contact Number:


Evaluation of Text Mining, Its Use for
Extraction of Effective and Efficient Data

Therisano A.
Reg. No: 14001299
Research Proposal
Department of Computer Science
and Information Systems,
Faculty of Science,
Botswana International
University of Science & Technology
E-mail: [email protected]
Contact Number: (+267) 74338126
 23 January



Text Mining
which is also referred to as Text Data Mining, is a concept of deriving high
quality information from natural language text. After the information is
contained or derived it is made available to data mining algorithms. There is quite
a lot that can be done with text mining, for example analysing clusters of
words that are within a document. Text mining was first introduced in the late
1990s but it emerged as “text data mining”.


Basic lexical
analysis counts the frequencies of words and terms in order to attempt to
classify a document by topic. Text mining or text data mining carry the
analysis/analytical process a step further. Data mining looks for hidden
complex patterns, relationships and datasets. Some of the techniques involved
include clustering, decision tress, classification, link analysis and many
more. These techniques can be used in data derived from textual sources, though
with adjustments in order to accommodate, high dimensionality of text derived
information if every term has been turned into analytical dimensionality.

Natural Language Processing. 4
1.1 Introduction to the Research Problem.. 5
1.2 Research Background. 6
1.3 Problem Statement. 6
1.4 Research Objectives. 6
1.4.1 General Objective. 6
1.4.2 Specific Objectives. 7
1.5 Research Questions. 7
1.6 Justification of the Study. 7
1.7 Proposal Structure. 7
2.1 Introduction. 8
2.X Conclusion. 8
3.1 Introduction. 9
3.2 Ethical and Philosophical Considerations. 9
3.3 Research Design. 9
3.4 Research Methods for Specific Objective 1. 9
3.5 Research Methods for Specific Objective 2. 9
3.6 Research Methods for Specific Objective 3. 9
3.X Conclusion. 10
References. 11




NLP- Natural Language Processing


1.1 Introduction to the Research Problem


size of data is increasing at a vigorous rate each day. Business industries,
organisations and all types of institution are storing their data
electronically. A huge amount of text is exchanged over the internet in the
form of repositories, digital libraries and other textual information such as
email, blogs and even social media network. Hence this makes it a challenge to
determine appropriate patterns and trends to extract valuable knowledge from
this large volume of data. 1

mining is a process to extract that interesting and significant patterns to explore
knowledge from textual data sources. Text mining is a multi-disciplinary field
based on information retrieval, data mining, machine learning, statistics, and
computational linguistics. Text mining techniques are continuously used or
applied in industry, academia, web applications, internet and other fields. It
is applied in areas like search engines, filter emails, fraud detection,
product suggestion analysis and social media, feature extraction, predictive
and trend analysis. 2

process of Text mining performs the following steps:

v  Collection
unstructured data from different sources in their available formats which may
include pdf, plain text, web pages

v  Cleansing
and pre-processing to detect and remove anomalies. Cleansing make sure to
capture the real essence of text available and is performed to remove stop
words stemming as well as indexing the data.

v  Processing
and controlling operations are applied to check and further clean the data set
by automatic processing.

v  Pattern
analysis is implemented, and this is done by Management Information System.

v  Extraction
of valuable and relevant information for effective and timely decision making
and trend analysis

The appropriate technique for mining text reduce the
time and effort to find relevant pattern for analysis and decision making. 3




1.2 Research Background


Text mining is
used to describe the application of data mining techniques to automated
discovery of useful or interesting knowledge from unstructured text. Several
techniques have been proposed for text mining which including conceptual
structure, association rule mining, episode rule mining, decision trees, and
rule induction methods. In addition, Information Retrieval techniques have
widely used the bag-of-words model for tasks such as document matching,
ranking, and clustering. 4


Referencing the task
of information extraction aims to find specific data in natural language text. Data
to be extracted/retrieved is given by a template which specifies a list of
slots and this slot are to be filled with substrings taken from the document. 1 Document can be
filled with templates and its filled template for an information extraction
task in the job-posting domain. This template can include slots that are filled
by strings which are taken directly from the document. Several slots may have
multiple fillers for the job-posting domain as in programming languages,
platforms, applications, and areas. Machine
learning techniques have been developed to automatically construct information
extractors for job postings. 3


Text Mining can be
visualized as consisting of two phases: first one being Text refining and
Knowledge distillation as the second phase. The text refining phase, transforms
the free form text documents and transforms it into a chosen intermediate form.
Knowledge distillation infers patterns or knowledge from intermediate form. The
Intermediate Form can be semi structured such as the conceptual graph
representation or structured such as relational data representation. 1

1.3 Problem Statement


Many issues occur
during the text mining process and effect the efficiency and effectiveness of
decision making. Text mining on large amount of data is not effective and
efficient, depending on the different types of techniques used. These
techniques include Information Extraction, Information Retrieval, Natural
Language Processing, Clustering and Text Summarization. 1

1.4 Research Objectives


The objective of
this paper is to analyse different text mining techniques which help to perform
text analytics effectively and efficiently from large amount of data. Moreover,
the issues that arise during text mining process are identified.


1.4.1 General Objective


To analyse the different text mining

To analyse techniques for large amounts of

To see/analyse the efficient techniques

To observe the difficulty of text mining




Specific Objectives


§  To come
up with the effective and efficient techniques

§  To
select the techniques which are good for large data

§  To see
which technique takes a long time to complete its task

1.5 Research Questions


How efficient is it to apply text mining
techniques to analysis text?

How effective are the text mining


1.6 Justification of the Study


The main reason of
this research is to see f Text mining under data mining has a beneficial/useful
intended purpose. The research truly goes in deep to see if text mining
benefits the Computer science, since analysis and patterns are important in the
world of computing.  The research will
get to discover the efficient and effective techniques, but elaborating each
one thoroughly.

1.7 Proposal Structure


The whole process of text mining consists of a number of subordinate
tasks. It is best or it is easier to distribute the tasks into the smaller
groups in order to receive the positive result of the process of analysis.
First there is the stage of the information retrieval, which is characterized
with the extraction of information valuable for the analysis. Then it is
followed by natural language processing, which presents the retrieved text in
the natural human language. Next is the stage of named entity recognition,
which recognizes information according to the certain common identifiers. Lastly
there are more complicated sentiment analysis and quantitative analysis
which analyse the data from all sides, involving the psychological and other





2.1 Introduction


Text mining also
called as text data mining, is defined of
identifying or extracting information from large amount of data 5- 1. It is characterized
as a knowledge intense process in which users interact with a document using
analysis tools. According to StatSoft the purpose of text mining is processing
of unstructured information and extraction of meaningful numerical data from
the text, which makes the information contained in the text more accessible to
various data mining techniques. 6 Using text mining
one has the capability to derive summaries from the documents in the set and
retrieve key concepts for the whole set of documents. 7 Text mining is a
combination of techniques from such areas as natural language processing,
information retrieval, information extraction and data mining. 8Moreover, each of
those techniques was developed long before the initial term of text mining was
formulated. The following steps can be included in text mining. 5

It converts the unstructured text into
structured data

 Identify the patterns from structured data

Analyse the patterns using Text Mining

 Extract the useful information from the text.


The techniques in
text mining from different areas such as information extraction, information
retrieval, natural language processing (NLP), categorization and clustering. 9 These stages of text
mining process can be made into a single workflow. In general, text mining
turns text into numbers, which can be later incorporated in other data analyses
to reveal interesting statistical results. 10

2.X Conclusion


availability of large amount of text-based data, make it a need for it to be
processed to extract valuable information. Text mining techniques are used to
analyse the interesting and relevant information effectively and efficiently
from large amount of unstructured data. Specific patterns and sequences are
applied in order to extract useful information by eliminating irrelevant
details for predictive analysis.




3.1 Introduction


are many techniques developed to address the problem of Text Mining, which is
considered to nothing more than the information retrieval according to the
requirements of a user. Information retrieval uses four methods:

Taxonomy Method

Term Based

Based Method

Based Method

3.2 Ethical and
Philosophical Considerations


Text miming gets useful data
from large amount of data that is helpful in progress of, industries, government
institutions and or researches. Considering Text mining it a great technique which
very helpful. It will not be a human interaction research.

3.3 Research Design


first step is to make time to go to the library, and gather the journals. Explore the relationship between two or more variables
through a correlational analysis.

3.4 Research
Methods for Specific Objective 1


To come up with the effective and efficient techniques:


Analysis and Experiment, since
the efficiency needs to be seen.

3.5 Research Methods
for Specific Objective 2


select the techniques which are good for large data:


Studies and Experiment.


3.6 Research Methods
for Specific Objective 3


To see
which technique takes a long time to complete its task:

Observation since the tasks will be running
simultaneously and we see the one which completes first.








3.X Conclusion


In conclusion the
rapid growth of digital data made available in current year’s knowledge
discovery and data mining have attracted great attention with very important
need for processing data into useful information and knowledge. 7 As a result, there
is growing research interest in the topic of text mining. In general text mining
consists of analysing large amount of text documents by coming up with key
phrases; concepts and many useful data., and prepare the text processed for
further analysis with data mining techniques. We have defined text mining
processing flow, applications of text mining and issues in text mining. Patterns
generated facilitate decision making in industries. 5 Overview of
concepts, applications, tools and issues of text mining are presented to give
the researchers to carry it to the next level. Both qualitative and
quantitative research will be practiced for the research.



Online. Available:




















Partnership for Sustainable Development Data, 2016. Online. Available:


“Sustainable Development Goals,” 2017. Online. Available:


Sustainable Development Agenda, “Sustainable Development Goals kick off
with start of new year,” 30 December 2015. Online. Available:


Nations General Assembly, “Transforming our world: the 2030 Agenda for
Sustainable Development,” United Nations General Assembly, 2015.


Nations General Assembly, “Report of the world commission on environment
and development: Our common future,” United Nations General Assembly, Oslo,