Exploring Data Sets#

  • This project was simply to find datasets online then create a python helper file.

  • The helper file is simply a dictionary with link to the location of the file and pandas read functions

This is what the file looked like:

myDictionary  = [
    {
        "URL" : 'http://users.stat.ufl.edu/~winner/data/ufo_location_shape.csv',
        "name" : "UFO",
        "load_function" : pd.read_csv
    },
    {
        "URL" : 'https://storage.googleapis.com/kagglesdsdata/datasets/2021/5514/cereal.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20241026%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20241026T200323Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=9c2da66351987f3fac5078571e5e1af112e507f9f47a20b165f07dd08e7466ee7b31dde7cdd26383c940a6912e59b614841a7d02904ef4b688f35d1638c57fee86921e10a94ceca264e1fb3db92fa67fde73ae9587f7df4b0a32aa4c543c0b5ba377f40191f870d0e64dda52e3e826a146c2d8da5953668d7fb0bccc9f73a78ee6570c565848af002b2b4dbea15580db88200468fa36b0ae1551c624fc910b783efc8af0e5a4b1a8163eb2e93598d6116dc757f23af153ae11576c72e7626e80c383ad55567635c9b23646c414015631ffeca954490dd8513e2bd942913dac5b3ab8552d68b9836cb7e26ef91f8a77d6e8a858f2b97a87850f790528f503213c',
        "name" : "CEREAL",
        "load_function" : pd.read_csv
    },
    {
        "URL" : 'https://eazyml.com/documents/Coffee%20As%20A%20Stimulant%20-%20Training%20Data.xlsx',
        "name" : "COFFEE",
        "load_function" :  pd.read_excel
    }
]

import pandas as pd
import dataSets

UFO sigtings and shape descriptions#

This data set was simply collected to see what similar atributes and looks have been associated with UFO Sigtings.

More info can be found here

Questions:

  • How was this data collected?

  • How was this used in any actual research?

UFO = dataSets.myDictionary[0]["load_function"](dataSets.myDictionary[0]["URL"])
UFO
Event.Date Shape Location State Country Source USA weekday
0 6/18/2016 Boomerang/V-Shaped South Barrington IL USA NUFORC 1 7
1 6/17/2016 Boomerang/V-Shaped Kuna ID USA NUFORC 1 6
2 5/30/2016 Boomerang/V-Shaped Lake Stevens WA USA NUFORC 1 2
3 5/27/2016 Boomerang/V-Shaped Gerber CA USA NUFORC 1 6
4 5/24/2016 Boomerang/V-Shaped Camdenton MO USA NUFORC 1 3
... ... ... ... ... ... ... ... ...
3641 11/2/2015 Unknown Phnom Penh NaN Cambodia NUFORC 0 2
3642 4/15/2015 Unknown Hemel Hempstead NaN England/UK NUFORC 0 4
3643 1/2/2005 Unknown Manat NaN Puerto Rico NUFORC 0 1
3644 5/4/1988 Unknown Bounty (the ship) NaN NaN NUFORC 0 4
3645 11/15/1978 Unknown NaN NaN Tonga NUFORC 0 4

3646 rows × 8 columns

80 Cereals#

This data set contains info on sugar, calorie, health rating, and brand association. I was meant to show how unhealthy cerals can be.

More can be found here

Questions:

  • The original data was collected by Petra Isenberg, Pierre Dragicevic and Yvonne Jansen, was there any bias?

  • What do the most healthy cereals have in common?

  • Are the healthy cerals that far from the less healthjy ones?

cereal = dataSets.myDictionary[1]["load_function"](dataSets.myDictionary[1]["URL"])
cereal
name mfr type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
0 100% Bran N C 70 4 1 130 10.0 5.0 6 280 25 3 1.0 0.33 68.402973
1 100% Natural Bran Q C 120 3 5 15 2.0 8.0 8 135 0 3 1.0 1.00 33.983679
2 All-Bran K C 70 4 1 260 9.0 7.0 5 320 25 3 1.0 0.33 59.425505
3 All-Bran with Extra Fiber K C 50 4 0 140 14.0 8.0 0 330 25 3 1.0 0.50 93.704912
4 Almond Delight R C 110 2 2 200 1.0 14.0 8 -1 25 3 1.0 0.75 34.384843
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
72 Triples G C 110 2 1 250 0.0 21.0 3 60 25 3 1.0 0.75 39.106174
73 Trix G C 110 1 1 140 0.0 13.0 12 25 25 2 1.0 1.00 27.753301
74 Wheat Chex R C 100 3 1 230 3.0 17.0 3 115 25 1 1.0 0.67 49.787445
75 Wheaties G C 100 3 1 200 3.0 17.0 3 110 25 1 1.0 1.00 51.592193
76 Wheaties Honey Gold G C 110 2 1 200 1.0 16.0 8 60 25 1 1.0 0.75 36.187559

77 rows × 16 columns

Coffee as a Stimulant#

A simple data set to see if coffee consumtion shows any relation to how fast a participant can type.

More can be found here

Questions:

  • Could the size of the person change these results?

  • why was only coffee used and not also energy drinks or other “energy” substitutes?

coffee = dataSets.myDictionary[2]["load_function"](dataSets.myDictionary[2]["URL"])
coffee
Cups of coffee consumed Caffeinated or Decaffeinated Coffee Brand Time of the day Typing Speed in characters per minute
0 2.0 Caffeinated Folgers Morning 260
1 1.0 Caffeinated Folgers Morning 205
2 1.0 Decaffeinated Folgers Morning 183
3 2.0 Caffeinated Nescafe Morning 247
4 1.0 Caffeinated Nescafe Morning 211
... ... ... ... ... ...
78 1.0 Decaffeinated Himalayan Evening 198
79 1.5 Decaffeinated Folgers Morning 185
80 1.5 Decaffeinated Himalayan Morning 191
81 1.5 Decaffeinated Nescafe Afternoon 187
82 1.5 Decaffeinated Folgers Evening 186

83 rows × 5 columns

coffee.describe()
Cups of coffee consumed Typing Speed in characters per minute
count 83.000000 83.000000
mean 2.246988 220.421687
std 1.213173 37.973393
min 1.000000 176.000000
25% 1.000000 190.000000
50% 2.000000 205.000000
75% 3.000000 257.500000
max 5.500000 291.000000