Section 1: Data Gathering
Data Gathering is the process of extracting and measuring necessary information (based on the variables given of interest) in a well-ordered structure, which helps one in answering relevant questions based on the accumulated data and evaluate results.
The technique used to gather information for the given coursework has been collected by manual gathering .Leading Newspapers such as The Telegraph , The Wall Street Journal and The Times of India were used a point of source for collecting information about latest trends and happening around globally.
The data set collected is stored in the Excel Spreadsheets with following columns
News Headlines, Description of the news headlines , five comments (stored in 5 columns) and five tweets for the corresponding headlines.The aim of the coursework is to analyse and predict the emotions based on the opinions given by the readers.
The reason to why manual collection was chosen were because of the following reasons
Handpick the most trendiest topics happening globally.
Select in comments which are relevant to the article above.Generally as per the analysis done while collecting the dataset, it was seen that people tend to become abusive and get swayed away while giving in opinions.At times there is no clear understanding of what the comments or tweets actually meant.
Retrieving tweets and comments (newspaper websites)on the same article was easier to be done through manual gathering.
The manual way of collecting the data could become a tedious task but for the coursework-2 dataset ,collecting five topics on an everyday basis (10days) seemed well structured and ordered.
The technique used to collect data was done manual collection.Excel Spreadsheets were used in order to store data in a well structured manner.Firstly ,going through Twitter to get in the most trading happenings and topics.Globally leading newspapers such as The Times of India,The Telegraph and The Wall Street Journal were used as reference to gain in opinions and tweets .Millions of people follow these newspapers, Hence it was very convenient to get access.
For example , The Times of India published a news headline about “Robert Mugabe agrees to stand down as Zimbabwe president” on 19th November 2017.There were about 15 comments expressing their views but the readers.Twitter also was filled with tweets about the same.
Around 50 data sets has been collected in a spreadsheets format.The spreadsheets contain the following columns
Serial Number.
News Headlines.
Description of the headlines( in about 4-5 sentences)
Comment Section :
Comment 1 for the headline.
Comment 2 for the headline.
Comment 3 for the headline.
Comment 4 for the headline.
Comment 5 for the headline.
Tweets for the corresponding headline :
Tweet 1 for the headline
Tweet 2 for the headline
Tweet 3 for the headline
Tweet 4 for the headline
Tweet 5 for the headline
Below is a screenshot from the spreadsheets demonstrating an example of a dataset.
Conclusively, the datasets include fifty news headlines ,corresponding descriptions of the news headlines , five comments(retrieved from the newspapers’ website under the comment section ) and five tweets for the same article tweeted by the newspapers.
Section 2: Crowd-source data gathering
The term Crowd Sourcing was coined by the Jeff Hows in Wired Magazine in 2006.
“Crowd Sourcing is the act of taking a job traditionally performed by a designated employee and outsourcing it to an undefined ,generally large group of people int the form of an open call”-Jeff Howe.
Wikipedia is one of the earliest examples of the crowd sourcing.
Crowd Sourcing is the process of connecting with large groups of people via the Internet to gain and share expertise ,knowledge and resources. Using the crowds (large group people connected via the Internet Platform) to help collect and organise information, this is called as Accessing Distributed Knowledge.Because of high connectivity of the internet, you can now reach loads and loads of people very quickly to gain and share opinions.This is where crowd sourcing comes into play.
To get successful results of crowdsourcing, it is best to target managed focused crowd.Crowd-sourcing is a way fo solving problems and producing things by connecting with people otherwise you wouldn’t know.Anyone can use crowdsourcing be it governments, business organisation or even individuals.
There are 4 different ways of how crowdsourcing works:
a.The first way allows you to access a large online labor force you can identify and select or post the work on the platform and let the workers identify it and review.
b. The second way allows you to communicate and ask the crowd to help you find out the find the solution for the given task or help.
c.The third way is when you have the knowledge and resources, but you need help finding and organising it .
d.The fourth way is when you require ideas from the crowd .In result they give you opinions based on the idea.
The crowd sourcing platform used here is “Amazon Mechanical Turk”.
Amazon Mechanical Turk is crowdsourcing internet marketplaces that enables requesters to co-operate and co-ordinate with human workforce and use human intelligence to carry-out tasks ,that the machines ( computers) are unable to do so.
To access amazon mechanical turk , type in www.mturk.com in the browser.
Below is the screenshot of the homepage of Amazon Mechanical Turk.
As you can see in the image above there are two sections namely Make Money and Get Results.
In the section “Make Money” by working on the HIT’s(Human Intelligence Tasks).Workers get to work on the tasks requested by the requester.Workers are paid to answer to the query requester has put forward.
In the section “Get Results” from Mechanical Turk Workers.The requesters here pay to the workers at Amazon Mechanical Turk to get answers to the task that has been uploaded.
Human Intelligence Tasks (HIT’s) -Tasks that are uploaded on the MTurk are those tasks that are carried out by Human Intelligence and not by machine intelligence.
The requesters can specify
Tasks to be carried out.
Keywords.
Deadline for the task to completed.
Reward for the task completion.
Time Allotted for task to be completed.
Qualifications of the worker i.e., Location of the worker and Worker Approval Rating.
Working Strategy of Amazon Mechanical Turk (Requester)
The working strategy of Amazon Mechanical Turk is shown in the chart below.
BEGIN A PROJECT
Definition of the goals and key factors of the project.
FRAGMENTATION OF PROJECT INTO TASKS AND DESIGNING HIT’S
Once you are clear with the goals and key components of the project, break the project into several tasks so that you could upload onto the MTurk.The Workers can then work parallel to get the project completed at a faster pace.
LOAD HIT’S
Millions of the HIT’s can be loaded onto this platform.Assignment of multiple workers so that they can provide you with different answers ,that’ll help compare the results that is to be concluded.
ACCEPTANCE OF WORKERS ASSIGNMENT
Incase for the project ,workers require some sort of special qualifications, that can be mentioned in requirements before they work on your HIT.
REVIEW OF THE WORK DONE BY THE WORKER
Review the work that the worker has done .If you accept the work has be done thoroughly , you only pay for the work done and approved.Else, you reject the work done by them.
COMPLETED
Project has been completed and workers have been paid for the work.
Interface for the MTurk workers to work for requesters for this course work
Below are the screenshots of how I have provided MTurk workers with an interface so that they could fill in the form and provide their answers to the questions asked about predicting emotions .This task was then assigned to 10 MTurk Workers .The reason to why I have chosen 10 workers was because it will help me understand how 10 people predict emotions based on just a comment and what thinking process goes before choosing the particular emotions.
Every worker has to be rewarded once the result that the have sent is approved by the requester.So a reward amount is set before we give in the task to the assigned worker.As per the project requirement ,we can also set a qualification parameter of the worker to work on the particular task.
Below is the preview of the HTML code of the interface used by the workers to give in their inputs:
Honey-Pot Strategy
Lance Spitzner defines HoneyPot as “ An Information System resource whose value lies in unauthorised or illicit use of that resource “
A honey Pot is an intrusion detection process (IDS) which is used for analysing the attackers movement that will help the system to build a better defensive mechanism against such attacks .These attacks are usually made up of a virtual machine thats sits on the network.
Its basically a system that is set up to attract and trap people who tries to find authorised access into other systems
There are three main goals of the Honey Pot System
The virtual system should look as if it is for real.The system should be able to attract uninvited intruders to connect to the system.
The virtual system should be monitored completely so that it is not being used to launch a massive attack on the system.
The look and feel of the virtual system should be as of the actual system with all files and directories ,that will catch the hackers attention.
Classification of Honey Pots
HoneyPots can be classified into two criteria.These classification helps in easy understanding of the operations and uses when planning an implementation of honeypots in a network.
The classifications are as follows :
1. Implementation Environment
a. Production HoneyPots
Honeypots that are used to secure organisational network in real production operating environments.Honeypot can be applied to three layers of prevention, detection and response in a network.
b. Research HoneyPots
Implementation of such honeypots are not down with an aim of protecting networks.They are more of educational resources for demonstration on research on attack patterns.
2. Level of Interaction
a.Low Interaction Honeypots
Such honeypots are easy to install.configure ,deliver and maintain are only used for specific attacks.In this there is no interaction with the underlying OS.
b. High Interaction Honeypots
Such Honeypots controls attackers at the network.The honeypot developers are provided with god knowledge about the attacker level and expertise.
Honeypot strategy for this coursework is to determine whether workers that were chosen for crowd sourcing the dataset are genuine or not.To check how genuine the workers are ,a system can be developed that can bring up set of questionnaires as a part of the task.The questionnaires should be structured in such a way that looks like part of task, yet helps the task developer to understand how genuine the worker is.
Section 3: Ground Truth Generation
On the results that I have received from the Amazon Mechanical Turk.A demonstration of what percentage of workers predicted the emotions they thought after reading the opinion.At the end,I would then generate the ground truth based on the majority of what has been predicted.Since the every comment and every tweet wee assigned to 10 workers on MTurk, so I have received a varied emotions predicted.This will be shown in the example below as a pie chart.
Example :
HEADLINE : Theresa May warned British public will 'go bananas' if she offers EU £40bn to settle Brexit divorce bill
Theresa May has been warned by a senior Tory MP that the British public will “go bananas” if she agrees to a Brexit divorce bill of £40billion or more.
Robert Halfon, the former deputy chairman of the Conservative Party, said voters would not accept such a payment at a time when public services are clamouring for more funding.
Comment Section:
Comment 1:It's pretty obvious.If she caves in on the Brexit ransom the Tories sill lose the next election. Bigly.
Comment 2: It's not a 'divorce' – it was never a 'marriage' – it was a treaty of convenience and cooperation and a business partnership. Stop emoting about it.
Comment 3 : David Davis should be very welcome to use RAF aircraft. It will be a poignant reminder to the EU of what the won't have come March next year if they don't learn to behave.
Comment 4 : Disgusted TM is gutless she is a traitor to the people and the Country she serves, EU is nothing without the UK 🇬🇧 for god sake women walk away…
Comment 5: The UK public is already going bananas at the blatant incompetence of Mrs. 'Maybe'.Offer a penny more and the public will escalate to 'ape caca' status
Tweets Section :
Tweet 1 : I don’t understand why UK is paying anything to leave EU. They have paid enough. And if they’re terming this a divorce, have EU pay alimony!
Tweet 2 : Tell the EU we won't pay a penny unless they negotiate and walk away if they refuse. Play hard ball it's better to walk away than accept a joke of a deal that we're likely to get.
Tweet 3 : 40bn not enough for divorce? wow this is greediness of the highest order
Tweet 4 : @theresa_may is out of her mind. she cant afford to give 1 pound to eu she cant even afford to pay people proper benefit money or pensions.
Tweet 5 : depends what she gets in return surely? 40bn seems reasonable if there's continued access to single market
The emotions that were given for the workers to choose from are
Happy
Angry/Disgust
Fear/Surprise
Sadness
Neutral
An Interface was designed for the workers to predict the emotions on MTurk.
Amazon Mechanical Turk Results for the above headline
Comment 1 : Comment 2 :
Comment 3 : Comment 4 :
Comment 5 :
Tweet 1 : Tweet 2 :
Tweet 3 : Tweet 4 :
Tweet 5 :
As per the pie chart ,The most predicted emotions are Angry/Disgust , Sadness and Surprise.
Conclusively the emotions predicted for each comment and tweets are as follows:
Comment 1 : Angry/Disgust
Comment 2 : Angry/Disgust
Comment 3: Angry/Disgust
Comment 4: Angry/Disgust
Comment 5: Angry/Disgust
Tweet 1 : Angry/Disgust
Tweet 2 : Angry/Disgust
Tweet 3 : Angry/Disgust
Tweet 4 : Angry/Disgust
Tweet 5 : Angry/Disgust
Therefore, from the above example it is shown that the opinions given by the people on the news headline “Theresa May warned British public will 'go bananas' if she offers EU £40bn to settle Brexit divorce bill” are predicted to be Angry/Disgust.
Section 4 : Critical Discussion
The goal of the coursework on “Emotion Data – Ground Truth Generation and Analysis” are as follows:
Creating a dataset by collecting news headlines, description of the corresponding news headlines, five comments and five tweets for the same article.
Once the data set has been created ,the data set is then given for Crowd-Sourcing. Each headline with all comments and tweets will be given further on a Crowd-Sourcing platform.For this coursework,Amazon Mechanical Turk was chosen.An Interface is being created with questionaries for the workers to answer.
The tasks were then assigned to 10 MTurk workers for diverse prediction of emotions.
Based on the results received from MTurk,Ground truth has to be generated for emotion analysis of the particular opinion.
Data was gathered by manual collection.Fifty datasets were sorted with five comments and five tweets for each headline.Manual collection of data helped me filter out comments at first to which were relevant and which were not.This could be time-consuming if the time is not managed properly.Fifty datasets were collected in a span of 10 days i.e., 5 topics per day.Time management was an important factor here.
The data sets were collected and stored in Spreadsheets.The Spreadsheets contained almost 13 columns and 50 rows.
Once the data set was collected ,this set was then divided into tasks so as to upload it onto crowd-sourcing platform “Amazon Mechanical Turk”.An interface was created with all questionnaire that were needed to answer for emotional data analysis.Crowd Sourcing is very helpful and advantageous in helping attain answers to the question that requesters want answers too.
Advantages of crowd-sourcing for the course-work
1.Connection with people across to seek answers and opinions to your ideas/suggestion/queries.
2. Gaining a global perspective of what a sentence/opinion could be predicted like.
3.Staying anonymous to give in idea if you are not sure if it will work.Upload idea on a crowd-sourcing platform to seek suggestions.
But there are disadvantages to crowd-sourcing as well
Quality of answers that could be predicted could range from worker to worker.
2. Costs could increase up if you need more than 5 workers opinion of a single task.Projects are basically divided into tasks, so collectively the cost would rise up.The cost may vary from platforms to platforms.
To check genuineness of the worker who are giving in the answers to the tasks you have submitted, honeypot has to be implemented.Honeypot will help in observing for sam detection.
Although the downside to using Honeypots are the number of resources being used.If any of the resource which collectively forms the base module is countered, the security of the base module goes down.HoneyPots have got limited vision, i.e., they can only scan the activity directly connected to the honeypot interface and not of the neighbouring systems.
If the quality of answers given by the crowd-sourcing platforms is not unto the mark, the ground truth generation could be affected.Hence, predicting emotions of the text would be difficult
One of the approach, that can be chosen is “Using WebCrawler” instead of manual data gathering.Web Crawler will help you attain data sets faster than the method chosen for this coursework.
Features of Web Crawler :
Easier Browsability.
Reduced traffic on the network, which will intern help in attaining search-directed access.
Multi-Functional robots (spiders) can help attain faster results even if asked to be done simultaneously.
For effective crowd-sourcing results :
Communicate clearly to what is you aim and what has to be achieved.The workers should get a clear understanding of what has to be done.
If you require you project work to be done on an expertise level, you can also set the qualification of worker’s.
Be brief while explaining your objectives.
4. If you find the task done by the worker is not unto your expectation, the task could be rejected.Its necessary for a requester to not be fully-depended upon the answers given.The task results have to be reviewed by the requester thoroughly.
5. Choose the right keywords ,so that the tasks are seen by the worker to be completed on time.
IMPORTANT NOTE ON REFERENCE:
Kindly note that the data set collected are from the following sources:
1.The Times of India
2.The Telegraph
3.The Wall Street Journal
The Data set includes the following below and is stored in Spreadsheet attached in the zip folder.
News headlines
Description of the news headline
Five Comments and Tweets