Evaluation Science & Technology E-mail: [email protected] Contact Number:

 

 
 
 
 
 
 
Evaluation of Text Mining, Its Use for
Extraction of Effective and Efficient Data
 
By
 

Therisano A.
Motsilenyane
 
 
Reg. No: 14001299
 
 
 
Research Proposal
 
 
Department of Computer Science
and Information Systems,
Faculty of Science,
Botswana International
University of Science & Technology
E-mail: [email protected]
Contact Number: (+267) 74338126
 
 
 
 
 
 
 
 
 
 23 January
2018
 

ABSTRACT

 

Text Mining
which is also referred to as Text Data Mining, is a concept of deriving high
quality information from natural language text. After the information is
contained or derived it is made available to data mining algorithms. There is quite
a lot that can be done with text mining, for example analysing clusters of
words that are within a document. Text mining was first introduced in the late
1990s but it emerged as “text data mining”.

 

Basic lexical
analysis counts the frequencies of words and terms in order to attempt to
classify a document by topic. Text mining or text data mining carry the
analysis/analytical process a step further. Data mining looks for hidden
complex patterns, relationships and datasets. Some of the techniques involved
include clustering, decision tress, classification, link analysis and many
more. These techniques can be used in data derived from textual sources, though
with adjustments in order to accommodate, high dimensionality of text derived
information if every term has been turned into analytical dimensionality.

TABLE
OF CONTENTS
ABSTRACT. 2
TABLE OF CONTENTS. 3
LIST OF ABBREVIATIONS. 4
NLP-
Natural Language Processing. 4
SECTION
ONE: INTRODUCTION.. 5
1.1 Introduction to the Research Problem.. 5
1.2 Research Background. 6
1.3 Problem Statement. 6
1.4 Research Objectives. 6
1.4.1 General Objective. 6
1.4.2 Specific Objectives. 7
1.5 Research Questions. 7
1.6 Justification of the Study. 7
1.7 Proposal Structure. 7
SECTION
TWO: LITERATURE REVIEW… 8
2.1 Introduction. 8
2.X Conclusion. 8
SECTION
THREE: METHODOLOGY.. 9
3.1 Introduction. 9
3.2 Ethical and Philosophical Considerations. 9
3.3 Research Design. 9
3.4 Research Methods for Specific Objective 1. 9
3.5 Research Methods for Specific Objective 2. 9
3.6 Research Methods for Specific Objective 3. 9
3.X Conclusion. 10
References. 11
APPENDICES. 12
                                                                                                            

 

LIST OF ABBREVIATIONS

 

NLP- Natural Language Processing

SECTION
ONE: INTRODUCTION

1.1 Introduction to the Research Problem

 

The
size of data is increasing at a vigorous rate each day. Business industries,
organisations and all types of institution are storing their data
electronically. A huge amount of text is exchanged over the internet in the
form of repositories, digital libraries and other textual information such as
email, blogs and even social media network. Hence this makes it a challenge to
determine appropriate patterns and trends to extract valuable knowledge from
this large volume of data. 1

Text
mining is a process to extract that interesting and significant patterns to explore
knowledge from textual data sources. Text mining is a multi-disciplinary field
based on information retrieval, data mining, machine learning, statistics, and
computational linguistics. Text mining techniques are continuously used or
applied in industry, academia, web applications, internet and other fields. It
is applied in areas like search engines, filter emails, fraud detection,
product suggestion analysis and social media, feature extraction, predictive
and trend analysis. 2

The
process of Text mining performs the following steps:

v  Collection
unstructured data from different sources in their available formats which may
include pdf, plain text, web pages

v  Cleansing
and pre-processing to detect and remove anomalies. Cleansing make sure to
capture the real essence of text available and is performed to remove stop
words stemming as well as indexing the data.

v  Processing
and controlling operations are applied to check and further clean the data set
by automatic processing.

v  Pattern
analysis is implemented, and this is done by Management Information System.

v  Extraction
of valuable and relevant information for effective and timely decision making
and trend analysis

The appropriate technique for mining text reduce the
time and effort to find relevant pattern for analysis and decision making. 3

 

 

 

1.2 Research Background

 

Text mining is
used to describe the application of data mining techniques to automated
discovery of useful or interesting knowledge from unstructured text. Several
techniques have been proposed for text mining which including conceptual
structure, association rule mining, episode rule mining, decision trees, and
rule induction methods. In addition, Information Retrieval techniques have
widely used the bag-of-words model for tasks such as document matching,
ranking, and clustering. 4

 

Referencing the task
of information extraction aims to find specific data in natural language text. Data
to be extracted/retrieved is given by a template which specifies a list of
slots and this slot are to be filled with substrings taken from the document. 1 Document can be
filled with templates and its filled template for an information extraction
task in the job-posting domain. This template can include slots that are filled
by strings which are taken directly from the document. Several slots may have
multiple fillers for the job-posting domain as in programming languages,
platforms, applications, and areas. Machine
learning techniques have been developed to automatically construct information
extractors for job postings. 3

 

Text Mining can be
visualized as consisting of two phases: first one being Text refining and
Knowledge distillation as the second phase. The text refining phase, transforms
the free form text documents and transforms it into a chosen intermediate form.
Knowledge distillation infers patterns or knowledge from intermediate form. The
Intermediate Form can be semi structured such as the conceptual graph
representation or structured such as relational data representation. 1

1.3 Problem Statement

 

Many issues occur
during the text mining process and effect the efficiency and effectiveness of
decision making. Text mining on large amount of data is not effective and
efficient, depending on the different types of techniques used. These
techniques include Information Extraction, Information Retrieval, Natural
Language Processing, Clustering and Text Summarization. 1

1.4 Research Objectives

 

The objective of
this paper is to analyse different text mining techniques which help to perform
text analytics effectively and efficiently from large amount of data. Moreover,
the issues that arise during text mining process are identified.

 

1.4.1 General Objective

 

·        
To analyse the different text mining
techniques

·        
To analyse techniques for large amounts of
data

·        
To see/analyse the efficient techniques

·        
To observe the difficulty of text mining

 

 

 

1.4.2
Specific Objectives

 

§  To come
up with the effective and efficient techniques

§  To
select the techniques which are good for large data

§  To see
which technique takes a long time to complete its task

1.5 Research Questions

 

o  
How efficient is it to apply text mining
techniques to analysis text?

o  
How effective are the text mining
techniques?

 

1.6 Justification of the Study

 

The main reason of
this research is to see f Text mining under data mining has a beneficial/useful
intended purpose. The research truly goes in deep to see if text mining
benefits the Computer science, since analysis and patterns are important in the
world of computing.  The research will
get to discover the efficient and effective techniques, but elaborating each
one thoroughly.

1.7 Proposal Structure

 

The whole process of text mining consists of a number of subordinate
tasks. It is best or it is easier to distribute the tasks into the smaller
groups in order to receive the positive result of the process of analysis.
First there is the stage of the information retrieval, which is characterized
with the extraction of information valuable for the analysis. Then it is
followed by natural language processing, which presents the retrieved text in
the natural human language. Next is the stage of named entity recognition,
which recognizes information according to the certain common identifiers. Lastly
there are more complicated sentiment analysis and quantitative analysis
which analyse the data from all sides, involving the psychological and other
aspects.

 

 

SECTION
TWO: LITERATURE REVIEW

 

2.1 Introduction

 

Text mining also
called as text data mining, is defined of
identifying or extracting information from large amount of data 5- 1. It is characterized
as a knowledge intense process in which users interact with a document using
analysis tools. According to StatSoft the purpose of text mining is processing
of unstructured information and extraction of meaningful numerical data from
the text, which makes the information contained in the text more accessible to
various data mining techniques. 6 Using text mining
one has the capability to derive summaries from the documents in the set and
retrieve key concepts for the whole set of documents. 7 Text mining is a
combination of techniques from such areas as natural language processing,
information retrieval, information extraction and data mining. 8Moreover, each of
those techniques was developed long before the initial term of text mining was
formulated. The following steps can be included in text mining. 5

·        
It converts the unstructured text into
structured data

·        
 Identify the patterns from structured data

·        
Analyse the patterns using Text Mining
techniques

·        
 Extract the useful information from the text.

 

The techniques in
text mining from different areas such as information extraction, information
retrieval, natural language processing (NLP), categorization and clustering. 9 These stages of text
mining process can be made into a single workflow. In general, text mining
turns text into numbers, which can be later incorporated in other data analyses
to reveal interesting statistical results. 10

2.X Conclusion

 

The
availability of large amount of text-based data, make it a need for it to be
processed to extract valuable information. Text mining techniques are used to
analyse the interesting and relevant information effectively and efficiently
from large amount of unstructured data. Specific patterns and sequences are
applied in order to extract useful information by eliminating irrelevant
details for predictive analysis.

 

SECTION
THREE: METHODOLOGY

 

3.1 Introduction

 

There
are many techniques developed to address the problem of Text Mining, which is
considered to nothing more than the information retrieval according to the
requirements of a user. Information retrieval uses four methods:

       
i.           
Pattern
Taxonomy Method

     
ii.           
Term Based
Method

   
iii.           
Concept
Based Method

   
iv.           
Phrase
Based Method

3.2 Ethical and
Philosophical Considerations

 

Text miming gets useful data
from large amount of data that is helpful in progress of, industries, government
institutions and or researches. Considering Text mining it a great technique which
very helpful. It will not be a human interaction research.

3.3 Research Design

 

The
first step is to make time to go to the library, and gather the journals. Explore the relationship between two or more variables
through a correlational analysis.

3.4 Research
Methods for Specific Objective 1

 

To come up with the effective and efficient techniques:

 

Content
Analysis and Experiment, since
the efficiency needs to be seen.

3.5 Research Methods
for Specific Objective 2

 

To
select the techniques which are good for large data:

 

Case
Studies and Experiment.

 

3.6 Research Methods
for Specific Objective 3

 

To see
which technique takes a long time to complete its task:

Observation since the tasks will be running
simultaneously and we see the one which completes first.

 

 

 

 

 

 

 

3.X Conclusion

 

In conclusion the
rapid growth of digital data made available in current year’s knowledge
discovery and data mining have attracted great attention with very important
need for processing data into useful information and knowledge. 7 As a result, there
is growing research interest in the topic of text mining. In general text mining
consists of analysing large amount of text documents by coming up with key
phrases; concepts and many useful data., and prepare the text processed for
further analysis with data mining techniques. We have defined text mining
processing flow, applications of text mining and issues in text mining. Patterns
generated facilitate decision making in industries. 5 Overview of
concepts, applications, tools and issues of text mining are presented to give
the researchers to carry it to the next level. Both qualitative and
quantitative research will be practiced for the research.

References

1

“https://thesai.org/Downloads/Volume7No11/Paper_53-Text_Mining_Techniques_Applications_and_Issues.pdf,”
Online. Available:
https://thesai.org/Downloads/Volume7No11/Paper_53-Text_Mining_Techniques_Applications_and_Issues.pdf.

2

https://thesai.org/Downloads/Volume7No11/Paper_53-.

3

“http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.403.2426&rep=rep1&type=pdf”.

4

“http://www.cs.utexas.edu/~ml/papers/discotex-melm-03.pdf”.

5

“http://www.b-eye-network.com/view/6311”

6

“http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0156031”.

7

“https://is.vsh.cz/th/12446/vsh_b/Thesis_Varfolomeeva.pdf”.

8

“https://paginas.fe.up.pt/~prodei/dsie15/web/papers/dsie15_submission_10.pdf”.

9

“Text_Mining_Techniques_Applications_and_Issues.pdf”.

10

“http://www.cs.utexas.edu/~ml/papers/discotex-melm-03.pdf”.

11

Global
Partnership for Sustainable Development Data, 2016. Online. Available:
http://www.data4sdgs.org/.

12

SEED,
“Sustainable Development Goals,” 2017. Online. Available:
https://www.seed.uno/about/work/sustainable-development-goals.html.

13

Secretary-General
Sustainable Development Agenda, “Sustainable Development Goals kick off
with start of new year,” 30 December 2015. Online. Available:
http://www.un.org/sustainabledevelopment/blog/2015/12/sustainable-development-goals-kick-off-with-start-of-new-year/.

14

United
Nations General Assembly, “Transforming our world: the 2030 Agenda for
Sustainable Development,” United Nations General Assembly, 2015.

15

United
Nations General Assembly, “Report of the world commission on environment
and development: Our common future,” United Nations General Assembly, Oslo,
1987.

 
 

 

 

 

APPENDICES