Exploring Data Sets#
This project was simply to find datasets online then create a python helper file.
The helper file is simply a dictionary with link to the location of the file and pandas read functions
This is what the file looked like:
myDictionary = [
{
"URL" : 'http://users.stat.ufl.edu/~winner/data/ufo_location_shape.csv',
"name" : "UFO",
"load_function" : pd.read_csv
},
{
"URL" : 'https://storage.googleapis.com/kagglesdsdata/datasets/2021/5514/cereal.csv?X-Goog-Algorithm=GOOG4-RSA-SHA256&X-Goog-Credential=gcp-kaggle-com%40kaggle-161607.iam.gserviceaccount.com%2F20241026%2Fauto%2Fstorage%2Fgoog4_request&X-Goog-Date=20241026T200323Z&X-Goog-Expires=259200&X-Goog-SignedHeaders=host&X-Goog-Signature=9c2da66351987f3fac5078571e5e1af112e507f9f47a20b165f07dd08e7466ee7b31dde7cdd26383c940a6912e59b614841a7d02904ef4b688f35d1638c57fee86921e10a94ceca264e1fb3db92fa67fde73ae9587f7df4b0a32aa4c543c0b5ba377f40191f870d0e64dda52e3e826a146c2d8da5953668d7fb0bccc9f73a78ee6570c565848af002b2b4dbea15580db88200468fa36b0ae1551c624fc910b783efc8af0e5a4b1a8163eb2e93598d6116dc757f23af153ae11576c72e7626e80c383ad55567635c9b23646c414015631ffeca954490dd8513e2bd942913dac5b3ab8552d68b9836cb7e26ef91f8a77d6e8a858f2b97a87850f790528f503213c',
"name" : "CEREAL",
"load_function" : pd.read_csv
},
{
"URL" : 'https://eazyml.com/documents/Coffee%20As%20A%20Stimulant%20-%20Training%20Data.xlsx',
"name" : "COFFEE",
"load_function" : pd.read_excel
}
]
import pandas as pd
import dataSets
UFO sigtings and shape descriptions#
This data set was simply collected to see what similar atributes and looks have been associated with UFO Sigtings.
Questions:
How was this data collected?
How was this used in any actual research?
UFO = dataSets.myDictionary[0]["load_function"](dataSets.myDictionary[0]["URL"])
UFO
| Event.Date | Shape | Location | State | Country | Source | USA | weekday | |
|---|---|---|---|---|---|---|---|---|
| 0 | 6/18/2016 | Boomerang/V-Shaped | South Barrington | IL | USA | NUFORC | 1 | 7 |
| 1 | 6/17/2016 | Boomerang/V-Shaped | Kuna | ID | USA | NUFORC | 1 | 6 |
| 2 | 5/30/2016 | Boomerang/V-Shaped | Lake Stevens | WA | USA | NUFORC | 1 | 2 |
| 3 | 5/27/2016 | Boomerang/V-Shaped | Gerber | CA | USA | NUFORC | 1 | 6 |
| 4 | 5/24/2016 | Boomerang/V-Shaped | Camdenton | MO | USA | NUFORC | 1 | 3 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 3641 | 11/2/2015 | Unknown | Phnom Penh | NaN | Cambodia | NUFORC | 0 | 2 |
| 3642 | 4/15/2015 | Unknown | Hemel Hempstead | NaN | England/UK | NUFORC | 0 | 4 |
| 3643 | 1/2/2005 | Unknown | Manat | NaN | Puerto Rico | NUFORC | 0 | 1 |
| 3644 | 5/4/1988 | Unknown | Bounty (the ship) | NaN | NaN | NUFORC | 0 | 4 |
| 3645 | 11/15/1978 | Unknown | NaN | NaN | Tonga | NUFORC | 0 | 4 |
3646 rows × 8 columns
80 Cereals#
This data set contains info on sugar, calorie, health rating, and brand association. I was meant to show how unhealthy cerals can be.
Questions:
The original data was collected by Petra Isenberg, Pierre Dragicevic and Yvonne Jansen, was there any bias?
What do the most healthy cereals have in common?
Are the healthy cerals that far from the less healthjy ones?
cereal = dataSets.myDictionary[1]["load_function"](dataSets.myDictionary[1]["URL"])
cereal
| name | mfr | type | calories | protein | fat | sodium | fiber | carbo | sugars | potass | vitamins | shelf | weight | cups | rating | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100% Bran | N | C | 70 | 4 | 1 | 130 | 10.0 | 5.0 | 6 | 280 | 25 | 3 | 1.0 | 0.33 | 68.402973 |
| 1 | 100% Natural Bran | Q | C | 120 | 3 | 5 | 15 | 2.0 | 8.0 | 8 | 135 | 0 | 3 | 1.0 | 1.00 | 33.983679 |
| 2 | All-Bran | K | C | 70 | 4 | 1 | 260 | 9.0 | 7.0 | 5 | 320 | 25 | 3 | 1.0 | 0.33 | 59.425505 |
| 3 | All-Bran with Extra Fiber | K | C | 50 | 4 | 0 | 140 | 14.0 | 8.0 | 0 | 330 | 25 | 3 | 1.0 | 0.50 | 93.704912 |
| 4 | Almond Delight | R | C | 110 | 2 | 2 | 200 | 1.0 | 14.0 | 8 | -1 | 25 | 3 | 1.0 | 0.75 | 34.384843 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 72 | Triples | G | C | 110 | 2 | 1 | 250 | 0.0 | 21.0 | 3 | 60 | 25 | 3 | 1.0 | 0.75 | 39.106174 |
| 73 | Trix | G | C | 110 | 1 | 1 | 140 | 0.0 | 13.0 | 12 | 25 | 25 | 2 | 1.0 | 1.00 | 27.753301 |
| 74 | Wheat Chex | R | C | 100 | 3 | 1 | 230 | 3.0 | 17.0 | 3 | 115 | 25 | 1 | 1.0 | 0.67 | 49.787445 |
| 75 | Wheaties | G | C | 100 | 3 | 1 | 200 | 3.0 | 17.0 | 3 | 110 | 25 | 1 | 1.0 | 1.00 | 51.592193 |
| 76 | Wheaties Honey Gold | G | C | 110 | 2 | 1 | 200 | 1.0 | 16.0 | 8 | 60 | 25 | 1 | 1.0 | 0.75 | 36.187559 |
77 rows × 16 columns
Coffee as a Stimulant#
A simple data set to see if coffee consumtion shows any relation to how fast a participant can type.
Questions:
Could the size of the person change these results?
why was only coffee used and not also energy drinks or other “energy” substitutes?
coffee = dataSets.myDictionary[2]["load_function"](dataSets.myDictionary[2]["URL"])
coffee
| Cups of coffee consumed | Caffeinated or Decaffeinated | Coffee Brand | Time of the day | Typing Speed in characters per minute | |
|---|---|---|---|---|---|
| 0 | 2.0 | Caffeinated | Folgers | Morning | 260 |
| 1 | 1.0 | Caffeinated | Folgers | Morning | 205 |
| 2 | 1.0 | Decaffeinated | Folgers | Morning | 183 |
| 3 | 2.0 | Caffeinated | Nescafe | Morning | 247 |
| 4 | 1.0 | Caffeinated | Nescafe | Morning | 211 |
| ... | ... | ... | ... | ... | ... |
| 78 | 1.0 | Decaffeinated | Himalayan | Evening | 198 |
| 79 | 1.5 | Decaffeinated | Folgers | Morning | 185 |
| 80 | 1.5 | Decaffeinated | Himalayan | Morning | 191 |
| 81 | 1.5 | Decaffeinated | Nescafe | Afternoon | 187 |
| 82 | 1.5 | Decaffeinated | Folgers | Evening | 186 |
83 rows × 5 columns
coffee.describe()
| Cups of coffee consumed | Typing Speed in characters per minute | |
|---|---|---|
| count | 83.000000 | 83.000000 |
| mean | 2.246988 | 220.421687 |
| std | 1.213173 | 37.973393 |
| min | 1.000000 | 176.000000 |
| 25% | 1.000000 | 190.000000 |
| 50% | 2.000000 | 205.000000 |
| 75% | 3.000000 | 257.500000 |
| max | 5.500000 | 291.000000 |