Exploring Data Sets

Exploring Data Sets#

This project was simply to find datasets online then create a python helper file.
The helper file is simply a dictionary with link to the location of the file and pandas read functions

This is what the file looked like:

myDictionary  = [
    {
        "URL" : 'http://users.stat.ufl.edu/~winner/data/ufo_location_shape.csv',
        "name" : "UFO",
        "load_function" : pd.read_csv
    },
    {
        "URL" : 'https://storage.googleapis.com/kagglesdsdata/datasets/2021/5514/cereal.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20241026%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20241026T200323Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=9c2da66351987f3fac5078571e5e1af112e507f9f47a20b165f07dd08e7466ee7b31dde7cdd26383c940a6912e59b614841a7d02904ef4b688f35d1638c57fee86921e10a94ceca264e1fb3db92fa67fde73ae9587f7df4b0a32aa4c543c0b5ba377f40191f870d0e64dda52e3e826a146c2d8da5953668d7fb0bccc9f73a78ee6570c565848af002b2b4dbea15580db88200468fa36b0ae1551c624fc910b783efc8af0e5a4b1a8163eb2e93598d6116dc757f23af153ae11576c72e7626e80c383ad55567635c9b23646c414015631ffeca954490dd8513e2bd942913dac5b3ab8552d68b9836cb7e26ef91f8a77d6e8a858f2b97a87850f790528f503213c',
        "name" : "CEREAL",
        "load_function" : pd.read_csv
    },
    {
        "URL" : 'https://eazyml.com/documents/Coffee%20As%20A%20Stimulant%20-%20Training%20Data.xlsx',
        "name" : "COFFEE",
        "load_function" :  pd.read_excel
    }
]

import pandas as pd

import dataSets

UFO sigtings and shape descriptions#

This data set was simply collected to see what similar atributes and looks have been associated with UFO Sigtings.

More info can be found here

Questions:

How was this data collected?
How was this used in any actual research?

UFO = dataSets.myDictionary[0]["load_function"](dataSets.myDictionary[0]["URL"])
UFO

	Event.Date	Shape	Location	State	Country	Source	USA	weekday
0	6/18/2016	Boomerang/V-Shaped	South Barrington	IL	USA	NUFORC	1	7
1	6/17/2016	Boomerang/V-Shaped	Kuna	ID	USA	NUFORC	1	6
2	5/30/2016	Boomerang/V-Shaped	Lake Stevens	WA	USA	NUFORC	1	2
3	5/27/2016	Boomerang/V-Shaped	Gerber	CA	USA	NUFORC	1	6
4	5/24/2016	Boomerang/V-Shaped	Camdenton	MO	USA	NUFORC	1	3
...	...	...	...	...	...	...	...	...
3641	11/2/2015	Unknown	Phnom Penh	NaN	Cambodia	NUFORC	0	2
3642	4/15/2015	Unknown	Hemel Hempstead	NaN	England/UK	NUFORC	0	4
3643	1/2/2005	Unknown	Manat	NaN	Puerto Rico	NUFORC	0	1
3644	5/4/1988	Unknown	Bounty (the ship)	NaN	NaN	NUFORC	0	4
3645	11/15/1978	Unknown	NaN	NaN	Tonga	NUFORC	0	4

3646 rows × 8 columns

80 Cereals#

This data set contains info on sugar, calorie, health rating, and brand association. I was meant to show how unhealthy cerals can be.

More can be found here

Questions:

The original data was collected by Petra Isenberg, Pierre Dragicevic and Yvonne Jansen, was there any bias?
What do the most healthy cereals have in common?
Are the healthy cerals that far from the less healthjy ones?

cereal = dataSets.myDictionary[1]["load_function"](dataSets.myDictionary[1]["URL"])
cereal

	name	mfr	type	calories	protein	fat	sodium	fiber	carbo	sugars	potass	vitamins	shelf	weight	cups	rating
0	100% Bran	N	C	70	4	1	130	10.0	5.0	6	280	25	3	1.0	0.33	68.402973
1	100% Natural Bran	Q	C	120	3	5	15	2.0	8.0	8	135	0	3	1.0	1.00	33.983679
2	All-Bran	K	C	70	4	1	260	9.0	7.0	5	320	25	3	1.0	0.33	59.425505
3	All-Bran with Extra Fiber	K	C	50	4	0	140	14.0	8.0	0	330	25	3	1.0	0.50	93.704912
4	Almond Delight	R	C	110	2	2	200	1.0	14.0	8	-1	25	3	1.0	0.75	34.384843
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
72	Triples	G	C	110	2	1	250	0.0	21.0	3	60	25	3	1.0	0.75	39.106174
73	Trix	G	C	110	1	1	140	0.0	13.0	12	25	25	2	1.0	1.00	27.753301
74	Wheat Chex	R	C	100	3	1	230	3.0	17.0	3	115	25	1	1.0	0.67	49.787445
75	Wheaties	G	C	100	3	1	200	3.0	17.0	3	110	25	1	1.0	1.00	51.592193
76	Wheaties Honey Gold	G	C	110	2	1	200	1.0	16.0	8	60	25	1	1.0	0.75	36.187559

77 rows × 16 columns

Coffee as a Stimulant#

A simple data set to see if coffee consumtion shows any relation to how fast a participant can type.

More can be found here

Questions:

Could the size of the person change these results?
why was only coffee used and not also energy drinks or other “energy” substitutes?

coffee = dataSets.myDictionary[2]["load_function"](dataSets.myDictionary[2]["URL"])
coffee

	Cups of coffee consumed	Caffeinated or Decaffeinated	Coffee Brand	Time of the day	Typing Speed in characters per minute
0	2.0	Caffeinated	Folgers	Morning	260
1	1.0	Caffeinated	Folgers	Morning	205
2	1.0	Decaffeinated	Folgers	Morning	183
3	2.0	Caffeinated	Nescafe	Morning	247
4	1.0	Caffeinated	Nescafe	Morning	211
...	...	...	...	...	...
78	1.0	Decaffeinated	Himalayan	Evening	198
79	1.5	Decaffeinated	Folgers	Morning	185
80	1.5	Decaffeinated	Himalayan	Morning	191
81	1.5	Decaffeinated	Nescafe	Afternoon	187
82	1.5	Decaffeinated	Folgers	Evening	186

83 rows × 5 columns

coffee.describe()

	Cups of coffee consumed	Typing Speed in characters per minute
count	83.000000	83.000000
mean	2.246988	220.421687
std	1.213173	37.973393
min	1.000000	176.000000
25%	1.000000	190.000000
50%	2.000000	205.000000
75%	3.000000	257.500000
max	5.500000	291.000000

Exploring Data Sets

Contents

Exploring Data Sets#

UFO sigtings and shape descriptions#

80 Cereals#

Coffee as a Stimulant#