Foundations: Data
Table of Contents
Foundations: Data, and more data #
Data analysis #
The collection, transformation, and organization of data in order to draw conclusions, make predictions, and drive informed decision-making.
Data analyst #
Someone who collects , transforms, and organizes data in order to help make informed decisions.
Data Analytics #
The science of data
Processes for data analysis: #
- Ask
- Prepare
- Process
- Analyze
- Share
- Act
Different questions like: #
- “How can we get customers to recycle our product packaging?” to:
- “What design features will make our packaging easier to recycler?”
The process you need to cover:
- Foundations: Data
- Ask questions to make data-driven decisions
- Prepare data for exploration
- Process data from dirty to clean
- Analyze data to answer questions
- Share data through the art of visualization
- Data analysis with R programming
- Data analytics capstone project: Complete case study
Foundations: Data: #
- Introducing data analytics and analytical thinking
- The wonderful world of data (data lifecycle, data analysis process)
- Set up your data analytics toolbox
- Become a fair and impactful data professional
Computer+ Your brain + Your skills + Your traits = Success
Turning data into insights
Human resources analytics or workforce analytics #
People analytics is the practice of collecting and analyzing data on the people who make up a company’s workforce in order to gain insights to improve how the company operates
The six phases of data analysis: #
Asky #
You’ll work to understand the challenge to be solved or the question to be answered
Prepare #
You’ll find and collect the data you’ll need to answer your questions
Process #
Is when you will clean and organize your data
Analyze #
is when you do the necessary data analysis to uncover answers and solutions
Share #
When you present your findings to decision-makers through a report, presentation, or data visualizations
Act #
In which you and others in the company put the data insights into action
The data analysis process is designed to build on itself, so the results from each step are the inputs for the next step. Keep in mind, however, that you might not always move through the steps linearly. For example, you might be in the analyze phase and find out your data was pulled from the wrong database. Or, you could learn while cleaning the data that your original question didn’t adequately define the problem.
The six phases of the data analysis process help answer business challenges, such as understanding how to improve a retirement program. Additionally, iterating on and reviewing your work throughout the data analysis process is critical for obtaining quality results.
data science, the discipline of making data useful,
is an umbrella term that encompasses three disciplines:
machine learning, statistics, and analytic.
If you want to make a few important decisions under
uncertainty, that is statistics.
If you want to automate, in other words, make many, many, many decisions under uncertainty, that is machine learning and AI.
But what if you don’t know how many decisions you want to make before you begin?
What if what you’re looking for is inspiration?
You want to encounter your unknown unknowns.
You want to understand your world.
That is analytics.
SAS’s iterative process #
- Ask
- Prepare
- Explore
- Model
- Implement
- Act
- Evaluat
Data ecosystems #
The various elements that interact with one another to produce, manage, store, organize, analyze, and share data
Data science #
Creating new ways of modeling and understanding the unknown by using raw data. “Data scientists create new questions using data, while analysts find answers to existing questions by creating insights from data sources”
Data-driven decision-making #
Using facts to guide business strategy
Step by step #
Ask questions and define the problem.
Prepare data by collecting and storing the information.
Process data by cleaning and checking the information.
Analyze data to find patterns, relationships, and trends.
Share data with your audience.
Act on the data and use the analysis results.
Gut instinct #
Analysts often ask, “How do I define success for this project?”
In addition, try asking yourself these questions about a project to help find the perfect balance:
What kind of results are needed?
Who will be informed?
Am I answering the question being asked?
How quickly does a decision need to be made?
Key takeaways #
Data analysts and detectives share a similar approach to problem-solving, both relying on evidence and facts to make decisions. Data-driven decision-making is essential for analysts, but gut instinct can also play a role in identifying patterns and connections. Balancing data and gut instinct is crucial for making informed decisions, and the right mix depends on the project’s goals and time constraints.
“Which analytical skill involves the ability to break things down into smaller steps and work with them in an orderly and logical way?” A technical mindset involves breaking things down into smaller steps and working with them in an orderly and logical way. Problem-solving is achieved with analytical skills.
Analytical skills #
Qualities and characteristics associated with solving problemas using facts
The ones that we are focusing are:
- Curiosity - Wanting to know something
- Understanding Context -
- Having a technical mindset
- Data design
- Data strategy
Context #
The condition in which something exists or happens.
A technical mindset #
The ability to break things down into smaller steps or pieces and work with them in an orderly an orderly way.
Data design #
How you organize information
Data strategy #
The management of the people, processes and tools used in data analysis.
exploratory data analysis (EDA)
Technical mindset #
Focusing on implementing a process, regardless of what that looks like, is a great first step to exercising your technical mindset.
Data Strategy #
Think about a data strategy as a kind of resource allocation—the tools, time, and effort that you put into a project will vary based on what you need to accomplish.
Analytical thinking #
Identifying and defining a problema and then solving it by using data in an organized, step-by-step manner
The five key aspects to analytical thinking #
- Visualization
- Strategy
- Problem-orientation
- Correlation
- Big-picture and detail-oriented thinking
Visualization #
The graphical representation of information
Strategic #
Strategizing helps data analysts see what they want to achieve with the data and how they can get there.
Problem-orientation #
identify, describe and solve problems. It’s all about keeping the problem top of mind throughout the entire project.
Correlation #
being able to identify a correlation between two or more pieces of data. For example a correlation between a rainier season leading to a high number of umbrellas being sold. But Correlation does not equal causation.
Big-picture and detail-oriented thinking #
If you only focus on individual pieces, you wouldn’t be able to see past that, which is why big-picture thinking is so important. It helps you zoom out and see possibilities and opportunities. This leads to exciting new ideas or innovations. On the flip side, detail-oriented thinking is all about figuring out all of the aspects that will help you execute a plan.
Some important questions: #
- What is the root cause of the problem?
- Where are the gaps in our process?
- What did we not consider before? This is a great way to think about what information or procedure might be missing from a process, so you can identify ways to make better decisions and strategies moving forward
Root cause #
The reason why a problem occurs
Ask, “why” five times to reveal the root cause The final answer will gives you some useful and sometimes surprising insights.
Gap analysis #
A method for examining and evaluating how a process works currently in order to get where you want to be in the future.
As a data professional, you can turn to the five whys whenever you feel stumped by a problem and need to approach it from a different perspective.
A quartile divides data points into four equal parts
Dataset: A collection of data that can be manipulated or analyzed as one unit
data analysis tools #
- spreadsheets
- databases
- query languages
- visualization software
The life cycle of data #
- Plan
- Capture
- Manage
- Analyze
- Archive
- Destroy
Planing #
Well before starting an analysis project, a business decides what kind of data it needs, how it will be managed throughout its life cycle, who will be responsible of it, and the optimal outcomes.
Capture #
This is where data is collected from a variety of different sources and brought into the organization
Manage #
How and where the data is store, the tools used to keep it safe an secure and the actions taken to make sure that it’s maintained properly. Very important to data cleansing.
Analyze #
Time to data the data, in this phase, the data is used to solve problems, make great decisions and support business goals.
Archive #
Storing data in a place where it’s still available but may not be used again
Destroy #
To destroy the data, the company would use a secure data erasure software
Database #
A collection of data stored in a computer system
Data life cycle #
Plan: Decide what kind of data is needed, how it will be managed, and who will be responsible for it.
Capture: Collect or bring in data from a variety of different sources.
Manage: Care for and maintain the data. This includes determining how and where it is stored and the tools used to do so.
Analyze: Use the data to solve problems, make decisions, and support business goals.
Archive: Keep relevant data stored for long-term and future reference.
Destroy: Remove data from storage and delete any shared copies of the data.
Govern how data is handled so that it is accurate, secure, and available to meet your organization’s needs.
Data analysis #
The process of analyzing data
The ask phase #
In this phase we define the problem to be solved and we make sure that we fully understand stakeholder expectations. “What is the purpose of this analysis?” “What are we hoping to learn from it?”
Stakeholders #
People who have invested time and resources into a project and are interested in the outcome
Defining a problem #
Look at the current state and identify how it’s different from the ideal state Determine who the stakeholders are in order to understand their expectations. For example, defining to include all types of risks that could affect the company or just risks related to weather.
Prepare #
different types of data and how to identify which kinds of data are most useful for solving a particular problem.
You’ll also discover why it’s so important that your data and results are objective and unbiased. “We need to be thinking about the type of data we need in order to answer the questions that we’ve set out to answer based on what we learned when we asked the right questions”
the process phase #
This usually means cleaning data, transforming it into a more useful format, combining two or more datasets to make information more complete and removing outliers, which are any data points that could skew the information. This phase is all about getting the details right. “This is where you get a chance to understand its structure, its quirks , its nuances, and you really get a chance to understand deeply what type of data you’re going to be working with and understanding what potential that data has to answer all of your questions” After cleaning the data and running all the quality assurance checks
Analyze #
involves using tools to transform and organize that information so that you can draw useful conclusions, make predictions, and drive informed decision-making.
“This is the point where we have to take a step back and let the data speak for itself”
Share phase #
Data analyst interpret results and share them with others, to helps stakeholders make effective data-driven decisions.
Act #
“All of this work from asking the right questions to collect your data, to analyzing and sharing doesn’t mean much of anything, if we aren’t taking action on what we’ve just learned”
A function in spreadsheets #
A present command that automatically performs a specific process or task using the data in a spreadsheet
Query language #
A computer programming language that allows you to retrieve and manipulate data from a database
Some popular visualization tools #
- Tableau
- Looker
Depending on which phase of the data analysis process you’re in, you will need to use different tools. For example, if you are focusing on creating complex and eye-catching visualizations, then the visualization tools we discussed earlier are the best choice. But if you are focusing on organizing, cleaning, and analyzing data, then you will probably be choosing between spreadsheets and databases using queries.
Database: A collection of data stored in a computer system
Formula: A set of instructions used to perform a calculation using the data in a spreadsheet
Function: A preset command that automatically performs a specified process or task using the data in a spreadsheet
Query: A request for data or information from a database
Query language: A computer programming language used to communicate with a database
Stakeholders: People who invest time and resources into a project and are interested in its outcome
Structured Query Language: A computer programming language used to communicate with a database
Spreadsheet: A digital worksheet
SQL: (Refer to Structured Query Language)
Organize your data #
One way to organize your data is by sorting it.
Select all columns that contain data. There are a few ways to select multiple cells:
To select nonadjacent cells and/or cell ranges, hold the Command (Mac) or Ctrl (PC) key and select the cells.
To select a range of cells, hold the Shift key and either drag your cursor over which cells you want to include or use the arrow keys to select a range.
Select a single cell and drag your cursor over the cells you want to include in your selection.
Select the Data menu.
Select Sort range, then select Advanced range sorting options.
In the Advanced range sorting options window, select the checkbox for Data has header row. Make sure that A to Z is selected.
Select the Sort by drop-down menu, then select Siblings.
Select Sort. This will organize the spreadsheet by the number of siblings, from lowest to highest.
The column labels are usually called attributes
Attribute #
A characteristic or quality of data used to label a column in a table
In a dataset, a row is also called an observation
Observation #
All of the attributes for something contained in a row of a data table
Formula #
A set of instructions that performs a specific action using the data in a spreadsheet
SQL #
- Store
- Organize
- Analyze
Query #
A request for data or information form a database
Syntax is the predetermined structure of a language that includes all required words, symbols, and punctuation,
Notice that unlike the SELECT command that uses a comma to separate fields / variables / parameters, the WHERE command uses the AND statement to connect conditions.
However, using capitalization and indentation can help you read the information more easily. Keep your queries neat, and they will be easier to review or troubleshoot if you need to check them later on.
Comments are text placed between certain characters, /* and */, or after two dashes –) as shown below new name or alias to the column AS
You create a SQL query similar to below, where <> means “does not equal”: SELECT * FROM Employee WHERE jobCode <> ‘INT’ AND salary <= 30000;
“Data visualizations are pictures. They are a wonderful way to take very basic ideas around data and data points and makes them come alive”
Attribute: A characteristic or quality of data used to label a column in a table
Function: A preset command that automatically performs a specified process or task using the data in a spreadsheet
Observation: The attributes that describe a piece of data contained in a row of a table
Oversampling: The process of increasing the sample size of nondominant groups in a population. This can help you better represent them and address imbalanced datasets
Self-reporting: A data collection technique where participants provide information about themselves
Issue #
A topic or subject to investigate
Question #
Designed ti discover information
Problem #
An obstacle or complication that needs to be worked out
Business task #
The question or problem data analysis answers for a business
Fairness #
Ensuring that your analysis doesn’t create or reinforce bias
Best practice
Explanation
Example
Consider all of the available data
Part of your job as a data analyst is to determine what data is going to be useful for your analysis. Often there will be data that isn’t relevant to what you’re focusing on or doesn’t seem to align with your expectations. But you can’t just ignore it; it’s critical to consider all of the available data so that your analysis reflects the truth and not just your own expectations.
A state’s Department of Transportation is interested in measuring traffic patterns on holidays. At first, they only include metrics related to traffic volumes and the fact that the days are holidays. But the data team realizes they failed to consider how weather on these holidays might also affect traffic volumes. Considering this additional data helps them gain more complete insights.
Identify surrounding factors
As you’ll learn throughout these courses, context is key for you and your stakeholders to understand the final conclusions of any analysis. Similar to considering all of the data, you also must understand surrounding factors that could influence the insights you’re gaining.
A human resources department wants to better plan for employee vacation time in order to anticipate staffing needs. HR uses a list of national bank holidays as a key part of the data-gathering process. But they fail to consider important holidays that aren’t on the bank calendar, which introduces bias against employees who celebrate them. It also gives HR less useful results because bank holidays may not necessarily apply to their actual employee population.
Include self-reported data
Self-reporting is a data collection technique where participants provide information about themselves. Self-reported data can be a great way to introduce fairness in your data collection process. People bring conscious and unconscious bias to their observations about the world, including about other people. Using self-reporting methods to collect data can help avoid these observer biases. Additionally, separating self-reported data from other data you collect provides important context to your conclusions!
A data analyst is working on a project for a brick-and-mortar retailer. Their goal is to learn more about their customer base. This data analyst knows they need to consider fairness when they collect data; they decide to create a survey so that customers can self-report information about themselves. By doing that, they avoid bias that might be introduced with other demographic data collection methods. For example, if they had sales associates report their observations about customers, they might introduce any unconscious bias the employees had to the data.
Use oversampling effectively
When collecting data about a population, it’s important to be aware of the actual makeup of that population. Sometimes, oversampling can help you represent groups in that population that otherwise wouldn’t be represented fairly. Oversampling is the process of increasing the sample size of nondominant groups in a population. This can help you better represent them and address imbalanced datasets.
A fitness company is releasing new digital content for users of their equipment. They are interested in designing content that appeals to different users, knowing that different people may interact with their equipment in different ways. For example, part of their user-base is age 70 or older. In order to represent these users, they oversample them in their data. That way, decisions they make about their fitness content will be more inclusive.
Think about fairness from beginning to end
To ensure that your analysis and final conclusions are fair, be sure to consider fairness from the earliest stages of a project to when you act on the data insights. This means that data collection, cleaning, processing, and analysis are all performed with fairness in mind.
A data team kicks off a project by including fairness measures in their data-collection process. These measures include oversampling their population and using self-reported data. However, they fail to inform stakeholders about these measures during the presentation. As a result, stakeholders leave with skewed understandings of the data. Learning from this experience, they add key information about fairness considerations to future stakeholder presentations.
Marketing analyst—analyzes market conditions to assess the potential sales of products and services
HR/payroll analyst—analyzes payroll data for inefficiencies and errors
Financial analyst—analyzes financial status by collecting, monitoring, and reviewing data
Risk analyst—analyzes financial documents, economic conditions, and client data to help companies determine the level of risk involved in making a particular business decision
Healthcare analyst—analyzes medical data to improve the business aspect of hospitals and medical facilities
To name a few others that sound similar but may not be the same role:
Business analyst—analyzes data to help businesses improve processes, products, or services
Data analytics consultant—analyzes the systems and models for using data
Data engineer—prepares and integrates data from different sources for analytical use
Data scientist—uses expert skills in technology and social science to find trends through data analysis
Data specialist—organizes or converts data for use in databases or software systems
Operations analyst—analyzes data to assess the performance of business operations and workflows
“Being open to learning is one of the most important qualities for a data analyst”
Ask questions to make Data-Driven Decisions #
Structured thinking #
The process of recognizing the current problem or situation, organizing available information, revealing gaps and opportunities, and identifying the options.
The modules
- Ask Effective questions
- Make data-driven decisions
- More spreadsheets basics
- Always remember the stakeholder
Step 1: Ask #
It’s impossible to solve a problem if you don’t know what it is. These are some things to consider:
Define the problem you’re trying to solve
Make sure you fully understand the stakeholder’s expectations
Focus on the actual problem and avoid any distractions
Collaborate with stakeholders and keep an open line of communication
Take a step back and see the whole situation in context
Questions to ask yourself in this step: #
What are my stakeholders saying their problems are?
Now that I’ve identified the issues, how can I help the stakeholders resolve their questions?
Step 2: Prepare #
You will decide what data you need to collect in order to answer your questions and how to organize it so that it is useful. You might use your business task to decide:
What metrics to measure
Locate data in your database
Create security measures to protect that data
Questions to ask yourself in this step: #
What do I need to figure out how to solve this problem?
What research do I need to do?
Step 3: Process #
Clean data is the best data and you will need to clean up your data to get rid of any possible errors, inaccuracies, or inconsistencies. This might mean:
Using spreadsheet functions to find incorrectly entered data
Using SQL functions to check for extra spaces
Removing repeated entries
Checking as much as possible for bias in the data
Questions to ask yourself in this step: #
What data errors or inaccuracies might get in my way of getting the best possible answer to the problem I am trying to solve?
How can I clean my data so the information I have is more consistent?
Step 4: Analyze #
You will want to think analytically about your data. At this stage, you might sort and format your data to make it easier to:
Perform calculations
Combine data from multiple sources
Create tables with your results
Questions to ask yourself in this step: #
What story is my data telling me?
How will my data help me solve this problem?
Who needs my company’s product or service? What type of person is most likely to use it?
Step 5: Share #
Everyone shares their results differently so be sure to summarize your results with clear and enticing visuals of your analysis using data via tools like graphs or dashboards. This is your chance to show the stakeholders you have solved their problem and how you got there. Sharing will certainly help your team:
Make better decisions
Make more informed decisions
Lead to stronger outcomes
Successfully communicate your findings
Questions to ask yourself in this step: #
How can I make what I present to the stakeholders engaging and easy to understand?
What would help me understand this if I were the listener?
Step 6: Act #
Now it’s time to act on your data. You will take everything you have learned from your data analysis and put it to use. This could mean providing your stakeholders with recommendations based on your findings so they can make data-driven decisions.
Questions to ask yourself in this step: #
- How can I use the feedback I received during the share phase (step 5) to actually meet the stakeholder’s needs and expectations?
These six steps can help you to break the data analysis process into smaller, manageable parts, which is called structured thinking. This process involves four basic activities:
Recognizing the current problem or situation
Organizing available information
Revealing gaps and opportunities
Identifying your options
Types of problems #
Making predictions #
Using data to make an informed decision about how things may be in the future
Categorizing things #
Assigning information to different groups or clusters based on common features
Spotting something unusual #
Identifying data that is different from the norm
Identifying themes #
Grouping categorized information into broader concepts
Discovering connections #
Finding similar challenges faced by different entities and combining data and insights to address them
Finding patterns #
Using historical data to understand what happened in the past and is therefore likely to happen again
Making predictions #
A company that wants to know the best advertising method to bring in new customers is an example of a problem requiring analysts to make predictions. Analysts with data on location, type of media, and number of new customers acquired as a result of past ads can’t guarantee future results, but they can help predict the best placement of advertising to reach the target audience.
Categorizing things #
An example of a problem requiring analysts to categorize things is a company’s goal to improve customer satisfaction. Analysts might classify customer service calls based on certain keywords or scores. This could help identify top-performing customer service representatives or help correlate certain actions taken with higher customer satisfaction scores.
Spotting something unusual #
A company that sells smart watches that help people monitor their health would be interested in designing their software to spot something unusual. Analysts who have analyzed aggregated health data can help product developers determine the right algorithms to spot and set off alarms when certain data doesn’t trend normally.
Identifying themes #
User experience (UX) designers might rely on analysts to analyze user interaction data. Similar to problems that require analysts to categorize things, usability improvement projects might require analysts to identify themes to help prioritize the right product features for improvement. Themes are most often used to help researchers explore certain aspects of data. In a user study, user beliefs, practices, and needs are examples of themes.
By now you might be wondering if there is a difference between categorizing things and identifying themes. The best way to think about it is: categorizing things involves assigning items to categories; identifying themes takes those categories a step further by grouping them into broader themes.
Discovering connections #
A third-party logistics company working with another company to get shipments delivered to customers on time is a problem requiring analysts to discover connections. By analyzing the wait times at shipping hubs, analysts can determine the appropriate schedule changes to increase the number of on-time deliveries.
Finding patterns #
Minimizing downtime caused by machine failure is an example of a problem requiring analysts to find patterns in data. For example, by analyzing maintenance data, they might discover that most failures happen if regular maintenance is delayed by more than a 15-day window.
Types of avoid questions #
Leading question #
Because it’s leading you to answer in a certain way.
Close-ended question #
That means it can be answered with a yes or no. These kind of questions rarely lead to valuable insights.
Questions that are too bague and lack context #
Smart questions #
Specific #
Specific questions are simple, significant, and focused on a single topic or a few closely related ideas
For example #
“Are kids getting enough exercise these days?” replaced to: “What percentage of kids achieve the recommended 60 minutes of physical activity at least five days a week”
Measurable #
Measurable questions can be quantified and assessed
For example #
Instead of: “Why did or recent video go viral?”, “How many times was our video shared on social channels the first week it was posted?”
Action-oriented #
Action-oriented questions encourage change
For example #
Instead of: “How can we get customers to recycle our product packaging?”, to: “What design features will make our packaging easier to recycle”
Relevant #
Relevant questions matter, are important, and have significance to help problem you’re trying to solve
For example #
“Why does it matter that Pine Barrens tree frogs started disappearing?” to, “What environmental factors changed in Durham, North Carolina, between 1983 and 2004 that could cause Pine Barrens tree frogs to disappear from the Sandhills Regions?”
Time-bound #
Time-bound questions specify the time to be studied. This limit the range of possibilities and enables the data analyst to focus on relevant data
Fairness #
Ensuring that your questions don’t create or reinforce bias
Specific:
Is the question specific? Does it address the problem? Does it have context? Will it uncover a lot of the information you need?
Measurable: Will the question give you answers that you can measure?
Action-oriented: Will the answers provide information that helps you devise some type of plan?
Relevant: Is the question about the particular problem you are trying to solve?
Time-bound: Are the answers relevant to the specific time being studied?
Things to avoid when asking questions #
Leading questions: questions that only have a particular response
- Example: This product is too expensive, isn’t it?
This is a leading question because it suggests an answer as part of the question. A better question might be, “What is your opinion of this product?” There are tons of answers to that question, and they could include information about usability, features, accessories, color, reliability, and popularity, on top of price. Now, if your problem is actually focused on pricing, you could ask a question like “What price (or price range) would make you consider purchasing this product?” This question would provide a lot of different measurable responses.
Closed-ended questions: questions that ask for a one-word or brief response only
- Example: Were you satisfied with the customer trial?
This is a closed-ended question because it doesn’t encourage people to expand on their answer. It is really easy for them to give one-word responses that aren’t very informative. A better question might be, “What did you learn about customer experience from the trial.” This encourages people to provide more detail besides “It went well.”
Vague questions: questions that aren’t specific or don’t provide context
- Example: Does the tool work for you?
This question is too vague because there is no context. Is it about comparing the new tool to the one it replaces? You just don’t know. A better inquiry might be, “When it comes to data entry, is the new tool faster, slower, or about the same as the old tool? If faster, how much time is saved? If slower, how much time is lost?” These questions give context (data entry) and help frame responses that are measurable (time).
Some common topics for questions include:
Objectives
Audience
Time
Resources
Security
For instance, if you have a conversation with someone who works in retail, you might lead with questions like:
S****pecific: Do you currently use data to drive decisions in your business? If so, what kind(s) of data do you collect, and how do you use it?
M****easurable: Do you know what percentage of sales is from your top-selling products?
A****ction-oriented: Are there business decisions or changes that you would make if you had the right information? For example, if you had information about how umbrella sales change with the weather, how would you use it?
R****elevant: How often do you review data from your business?
T****ime-bound: Can you describe how data helped you make good decisions for your store(s) this past year?
If you are having a conversation with a teacher, you might ask different questions, such as:
S****pecific: What kind of data do you use to build your lessons?
M****easurable: How well do student benchmark test scores correlate with their grades?
A****ction-oriented: Do you share your data with other teachers to improve lessons?
R****elevant: Have you shared grading data with an entire class? If so, do students seem to be more or less motivated, or about the same?
Time-bound:* In the last five years, how many times did you review data from previous academic years?
If you are having a conversation with a small business owner of an ice cream shop, you could ask:
Specific:* What data do you use to help with purchasing and inventory?
Measurable:* Can you order (rank) these factors from most to least influential on sales: price, flavor, and time of year (season)?
Action-oriented:* Is there a single factor you need more data on so you can potentially increase sales?
Relevant:* How do you advertise to or communicate with customers?
Time-bound:* What does your year-over-year sales growth look like for the last three years?
Good Notes #
Helpful aspects of your conversation to note include:
Facts: Write down any concrete piece of information, such as dates, times, names, and other specifics.
Context: Facts without context are useless. Note any relevant details that are needed in order to understand the information you gather.
Unknowns: Sometimes you may miss an important question during a conversation. Make a note when this happens so you can figure out the answer later.
For example, if the previous SMART questions led the ice cream shop owner to propose a project to analyze customer flavor preferences, your notes might appear something like this:
Project: Collect customer flavor preference data.
Overall business goal: Use data to offer or create more popular flavors.
Two data sources: Cash register receipts and completed customer surveys (email).
Target completion date: Q2
To do: Call back later and speak with the manager about the location of survey data.
Structured thinking: The process of recognizing the current problem or situation, organizing available information, revealing gaps and opportunities, and identifying options
Data-driven decisions #
data-driven decision-making means using facts to guide business strategy. The phrase “data-driven decisions” means exactly that: Data is used to arrive at a decision. This approach is limited by the quantity and quality of readily-available data. If the quality and quantity of the data is sufficient, this approach can far improve decision-making. But if the data is insufficient or biased, this can create problems for decision-makers. Potential dangers of relying entirely on data-driven decision-making can include overreliance on historical data, a tendency to ignore qualitative insights, and potential biases in data collection and analysis.
Data-inspired decisions #
Data-inspired decisions include the same considerations as data-driven decisions while adding another layer of complexity. They create space for people using data to consider a broader range of ideas: drawing on comparisons to related concepts, giving weight to feelings and experiences, and considering other qualities that may be more difficult to measure. Data-inspired decision-making can avoid some of the pitfalls that data-driven decisions might be prone to.
There are two kinds of data, quantitative and qualitative
Quantitative data is all about the specific and objective measures of numerical facts.
This can often be the what, how many, and how often about a problem.
In other words, things you can measure, like how many commuters take the train to work every week.
On the other hand, qualitative data describes subjective or explanatory measures of qualities and characteristics or things that can’t be measured with numerical data, like your hair color.
Qualitative data is great for helping us answer why questions.
With quantitative data we can see numbers visualized as charts or graphs. Qualitative data can give us a more high-level understanding of why the numbers are the way they are. This is important because it helps us add context to a problem
Report #
Static collection of data given to stakeholders periodically Pros
- High-level historical data
- Easy to design
- Pre-cleaned and sorted data
Cons
- Continual maintenance
- Less visually appealing
- Static
Dashboard #
Monitors live, incoming data Pros
- Dynamic, automatic, and interactive
- More stakeholder access
- Low maintenance
Cons
- Labor-intensive design
- Can be confusing
- Potentially uncleaned data
Pivot table #
A data summarization tool that is used in data processing. Pivot tables are used to summarize, sort, reorganize, group, count, total or average data stored in a database.
Metric #
Single, quantifiable type of data that can be used for measurement.
Metric goal #
A measurable goal set by a company and evaluated using metrics
Types of dashboards #
Strategic: focuses on long term goals and strategies at the highest level of metrics These dashboards provide information over the longest time frame—from a single financial quarter to years
Operational: short-term performance tracking and intermediate goals
these dashboards contain information on a time scale of days, weeks, or months, they can provide performance insight almost in real-time.
Analytical: consists of the datasets and the mathematics used in these sets
These dashboards contain the details involved in the usage, analysis, and predictions made by data scientists.
Certainly the most technical category, analytic dashboards are usually created and maintained by data science teams and rarely shared with upper management as they can be very difficult to understand
Small data
- Describes a dataset made up of specific metrics over a short, well-defined time period
- Usually organized and analyzed in spreadsheets
- Likely to be used by small and midsize businesses
- Simple to collect, store, manage, sort, and visually represent
- Usually already a manageable size for analysis
Big data
- Describes large, less-specific datasets that cover a long time period
- Usually kept in a database and queried
- Likely to be used by large organizations
- Takes a lot of effort to collect, store, manage, sort, and visually represent
- Usually needs to be broken into smaller pieces in order to be organized and analyzed effectively for decision-making
Volume The amount of data Variety The different kinds of data Velocity How fast the data can be processed Veracity The quality and reliability of the data
Return on investment (ROI): A formula that uses the metrics of investment and profit to evaluate the success of an investment
Revenue: The total amount of income generated by the sale of goods or services
Spreadsheet tasks #
- Organize your data – Pivot table –1. Sort and filter
- Calculate your data – Formulas – Functions
Spread sheet and the data life cycle #
Plan for the users who will work within a spreadsheet by developing organizational standards.
Capture data by the source by connecting spreadsheets to other data sources, such as an online survey application or a database.
Manage different kinds of data with a spreadsheet. This can involve storing, organizing, filtering, and updating information. Spreadsheets also let you decide who can access the data, how the information is shared, and how to keep your data safe and secure.
Analyze data in a spreadsheet to help make better decisions. Some of the most common spreadsheet analysis tools include formulas to aggregate data or create reports, and pivot tables for clear, easy-to-understand visuals.
Archive any spreadsheet that you don’t use often, but might need to reference later with built-in tools.
Destroy your spreadsheet when you are certain that you will never need it again, if you have better backup copies, or for legal or security reasons.
Spreadsheets Errors #
#N/A Error - Data in a formula can’t be found by the spreadsheet #Name? - A formula of function name isn’t understood #Num! - A formula or function calculation can’t be performed as specified by the data #Value! - A general error that could indicate a problem with formula or referenced cells #Ref! - A formula is referencing a cell that is no longer valid or has been deleted
Calculating the change form one month to another #
(first month - next month)/first month
Summary the number of applicants in a range of time, given the records #
=COUNTIF(‘raw data’!G:G,A2) Count if G column in raw data sheet is the same as A2
Problem domain #
The specific area of analysis that encompasses every activity affecting or affected by the problem
Scope of work (SOW) #
An agreed-upon outline of the work you’re going to perform in a project
A statement of work #
is a document that clearly identifies the products and services a vendor or contractor will provide to an organization. It includes objectives, guidelines, deliverables, schedule, and costs.
Creating a scope of work #
Deliverables: What work is being done, and what things are being created as a result of this project? When the project is complete, what are you expected to deliver to the stakeholders? Be specific here. Will you collect data for this project? How much, or for how long?
Milestones: This is closely related to your timeline. What are the major milestones for progress in your project? How do you know when a given part of the project is considered complete?
Timeline: Your timeline will be closely tied to the milestones you create for your project. The timeline is a way of mapping expectations for how long each step of the process should take. The timeline should be specific enough to help all involved decide if a project is on schedule. When will the deliverables be completed? How long do you expect the project will take to complete? If all goes as planned, how long do you expect each component of the project will take? When can we expect to reach each milestone?
Reports: Good SOWs also set boundaries for how and when you’ll give status updates to stakeholders. How will you communicate progress with stakeholders and sponsors, and how often? Will progress be reported weekly? Monthly? When milestones are completed? What information will status reports contain?
The importance of context #
Who: The person or organization that created, collected, and/or funded the data collection
What: The things in the world that data could have an impact on
Where: The origin of the data
When: The time when the data was created or collected
Why: The motivation behind the creation or collection
How: The method used to create or collect it
Turnover rate: The rate at which employees leave a company
There are three common stakeholder groups that you might find yourself working with: the executive team, the customer-facing team, and the data science team.
Priority #
Who are the primary and secondary stakeholders? Probably de vice president of HR
Who is managing the data?
Where you can go for help?
Vice president of sales #
The VP of sales provides strategic and operational direction but is less interested in specific details. Ning prepares questions ahead of time to focus on the key findings that the company expects from an annual sales report.
Before you communicate, think about #
- who your audice is
- What they already know
- What they need to know
- How you can communicate that effectively to them
Sometimes you have to know what they really want to know #
- Reframe question
- Problems
- Challenges
- Solutions
- Timelines
Redirecting the conversation will help you find the real problem which leads to more insightful and accurate solutions.
Recomendations from Avinash Kaushik
Compare the same types of data: Data can get mixed up when you chart it for visualization. Be sure to compare the same types of data and double check that any segments in your chart definitely display different metrics.
Visualize with care: A 0.01% drop in a score can look huge if you zoom in close enough. To make sure your audience sees the full story clearly, it is a good idea to set your Y-axis to 0.
Leave out needless graphs: If a table can show your story at a glance, stick with the table instead of a pie chart or a graph. Your busy audience will appreciate the clarity.
Test for statistical significance: Sometimes two datasets will look different, but you will need a way to test whether the difference is real and important. So remember to run statistical tests to see how much confidence you can place in that difference.
Pay attention to sample size: Gather lots of data. If a sample size is small, a few unusual responses can skew the results. If you find that you have too little data, be careful about using it to form judgments. Look for opportunities to collect more data, then chart those trends over longer periods.
Are there other angles you haven’t considered? Can you answer any questions that may get asked about your data and analysis? That last question brings up something else to think about. How detailed should you be when sharing your results?
Meetings #
How a project is going Ask questions Bring what you need Reading the meeting agenda Prepare notes Clear decision 10 people Dont
If there’s any conflict in your team or project:
- Try to reframe the problem by asking: how can i help you reach your goal?
- Ask questions like: Are there other important things i should be considering?
- Understand the context, try to ask questions like: What your end goal is? what story they’re trying to tell with the data? What the big picture is?
- Instead of saying, “There’s no way I can do that in this time frame,” try to re-frame it by saying, “I would be happy to do that, but I’ll just take this amount of time, let’s take a step back so I can better understand what you’d like to do with the data and we can work together to find the best path forward.” Turnover rate: The rate at which employees voluntarily leave a company
Prepare data for exploration #
How data is collected #
- Interviews
- Observations
- Forms
- Questionnaires
- Surveys
- Cookies
Data collection considerations
- How the data will be collected
- Choose data sources
- Decide what data to use
- How much data to collect
- Select the right data type
- Determine the time frame
First-party data Data collected by an individual or group using their own resources
Second-party data Data collected by a group directly form it’s audience and then sold
Third-party data Data collected from outside sources who did not collect it directly
Following are some data-collection considerations to keep in mind for your analysis:
How the data will be collected #
Decide if you will collect the data using your own resources or receive (and possibly purchase it) from another party. Data that you collect yourself is called first-party data.
Data sources #
If you don’t collect the data using your own resources, you might get data from second-party or third-party data providers. Second-party data is collected directly by another group and then sold. Third-party data is sold by a provider that didn’t collect the data themselves. Third-party data might come from a number of different sources.
Solving your business problem #
Datasets can show a lot of interesting information. But be sure to choose data that can actually help solve your problem question. For example, if you are analyzing trends over time, make sure you use time series data — in other words, data that includes dates.
How much data to collect #
If you are collecting your own data, make reasonable decisions about sample size. A random sample from existing data might be fine for some projects. Other projects might need more strategic data collection to focus on certain criteria. Each project has its own needs.
Time frame #
If you are collecting your own data, decide how long you will need to collect it, especially if you are tracking trends over a long period of time. If you need an immediate answer, you might not have time to collect new data. In this case, you would need to use historical data that already exists.
Discrete data #
Data that is counted and has a limited number of values. Discrete like Discrete Maths
Continuos data #
Data that is measured and can have almost any numeric value
Nominal data #
A type of qualitative data that is categorized without a set order. “Yes”/“No”/“Idk”
Ordinal data #
A type of qualitative data with a set order or scale. If you asked a group of people to rank a movie from 1 to 5, some might rank it as a 2, others a 4, and so on. These rankings are in order of how much each person liked the movie.
Internal data #
Data that lives within a company’s own systems
External data #
Data that lives and is generated outside of an organization
Structured data #
Data organized in a certain format such as rows and columns
Unstructured data #
Data that is not organized in any easily identifiable manner. Like audio, and video files
Primary versus secondary data #
The following table highlights the differences between primary and secondary data and presents examples of each.
Data format classification
Definition
Examples
Primary data
Collected by a researcher from first-hand sources
Data from an interview you conducted - Data from a survey returned from 20 participants
Data from questionnaires you got back from a group of workers
Secondary data
Gathered by other people or from other research
Data you bought from a local data analytics firm’s customer profiles
Demographic data collected by a university
Census data gathered by the federal government
Internal versus external data #
The following table highlights the differences between internal and external data and presents examples of each.
Data format classification
Definition
Examples
Internal data
Data that is stored inside a company’s own systems
Wages of employees across different business units tracked by HR
Sales data by store location
Product inventory levels across distribution centers
External data
Data that is stored outside of a company or organization
National average wages for the various positions throughout your organization
Credit reports for customers of an auto dealership
Continuous versus discrete data #
The following table highlights the differences between continuous and discrete data and presents examples of each.
Data format classification
Definition
Examples
Continuous data
Data that is measured and can have almost any numeric value
Height of kids in third grade classes (52.5 inches, 65.7 inches)
Runtime markers in a video
Temperature
Discrete data
Data that is counted and has a limited number of values
Number of people who visit a hospital on a daily basis (10, 20, 200)
Maximum capacity allowed in a room
Tickets sold in the current month
Qualitative versus quantitative data #
The following table highlights the differences between qualitative and quantitative data and presents examples of each.
Data format classification
Definition
Examples
Qualitative
A subjective and explanatory measure of a quality or characteristic
Favorite exercise activity
Brand with best customer service
Fashion preferences of young adults
Quantitative
A specific and objective measure, such as a number, quantity, or range
Percentage of board certified doctors who are women
Population size of elephants in Africa
Distance from Earth to Mars at a particular time
Nominal versus ordinal data #
The following table highlights the differences between nominal and ordinal data and presents examples of each.
Data format classification
Definition
Examples
Nominal
A type of qualitative data that is categorized without a set order
First time customer, returning customer, regular customer
New job applicant, existing applicant, internal applicant
New listing, reduced price listing, foreclosure
Ordinal
A type of qualitative data with a set order or scale
Movie ratings (number of stars: 1 star, 2 stars, 3 stars)
Ranked-choice voting selections (1st, 2nd, 3rd)
Satisfaction level measured in a survey (satisfied, neutral, dissatisfied)
Structured versus unstructured data #
The following table highlights the differences between structured and unstructured data and presents examples of each.
Data format classification
Definition
Examples
Structured data
Data organized in a certain format, like rows and columns
Expense reports
Tax returns
Store inventory
Unstructured data
Data that cannot be stored as columns and rows in a relational database.
Social media posts
Emails
Videos
Data model #
A model that is used for organizing data elements and how thay relate to one another.
Sources of structured data #
- Spreadsheets
- Databases that store datasets
Data modeling is the process of creating diagrams that visually represent how data is organized and structured. These visual representations are called data models. A data model is used to organize data elements and how they relate to one another. Data elements are pieces of information, such as people’s names, account numbers, and addresses.
Conceptual data modeling gives a high-level view of the data structure, such as how data interacts across an organization. For example, a conceptual data model may be used to define the business requirements for a new database. A conceptual data model doesn’t contain technical details.
Logical data modeling focuses on the technical details of a database such as relationships, attributes, and entities. For example, a logical data model defines how individual records are uniquely identified in a database. But it doesn’t spell out actual names of database tables. That’s the job of a physical data model.
Physical data modeling depicts how a database operates. A physical data model defines all entities and attributes used; for example, it includes table names, column names, and data types for the database.
Entity Relationship Diagram (ERD) and the Unified Modeling Language (UML) diagram. ERDs are a visual way to understand the relationship between entities in the data model. UML diagrams are very detailed diagrams that describe the structure of a system by showing the system’s entities, attributes, operations, and their relationships.
Data type #
A specific kind of data attribute that tells what kind of value the data is.
Data types in spreadsheets #
- Number
- Text or string
- Boolean
Wide data #
Data in which every data subject has a single row with multiple columns to hold the values of various attributes of the subject. Easier to compare.
Long data #
Data in which each row is one time point per subject, so each subject will have data in multiple rows.
Data -> Collection of facts.
Data transformation usually involves:
- Adding, copying, or replicating data
- Deleting fields or records
- Standardizing the names of variables
- Renaming, moving, or combining columns in a database
- Joining one set of data with another
- Saving a file in a different format. For example, saving a spreadsheet as a comma separated values (.csv) file.
Goals for data transformation might be:
Data organization: better organized data is easier to use
Data compatibility: different applications or systems can then use the same data
Data migration: data with matching formats can be moved from one system to another
Data merging: data with the same organization can be merged together
Data enhancement: data can be displayed with more detailed fields
Data comparison: apples-to-apples comparisons of the data can then be made.
Wide data is preferred when
- Creating tables and charts with a few variables about each subject.
- Comparing straightforward line graphs
Long data is preferred when
- Storing a lot of variables about each subject. For example, 60 years worth of interest rates for each bank
- Performing advanced statistical analysis or graphing
Field: A single piece of information from a row or column of a spreadsheet; in a data table, typically a column in the table.
Bias #
A preference in favor of or against a person, group of people, or thing.
Data bias #
A type of error that systematically skews results in a certain direction.
Sampling bias #
When a sample isn’t representative of the population as a whole.
unbiased sampling #
When a sample is representative of the population being measured
Observer bias (Experimenter bias/ research bias) #
The tendency for different people to observe things differently.
Interpretation bias #
The tendency to always interpret ambiguos situations in a positive or negative way.
Confirmation bias #
The tendency to search for or interpret information in a way that confirms pre-existing beliefs.
Types of data bias #
- Sampling bias
- Observer bias
- Interpretation bias
- Confirmation bias
Good data sources ( ROCCC) #
Reliable - Good data sources are reliable Original - Validate the data with the original source Comprehensive - The best data sources contain all critical information needed to answer the question or find the solution. Current - The best data sources are current and relevant to the task at hand. Cited - Cited makes the information you’re providing more credible.
Bad data sources ( ROCCC) #
Not Reliable - Bad data can’t be trusted because it’s inaccurate, incomplete, or biased. Not Original - If can’t locate the original data source and you’re just relying on second or third party information, that can signal you may need be extra careful in understanding your data. Not Comprehensive - Dab data sources are missing important information needed to answer the question of find the solution. Not Current - Bad data sources are out of date and irrelevant. Not Cited - If your source hasn’t been cited or vetted it’s a no-go.
Ethics #
Well-founded standards of right and wrong that prescribe what humans ought to do, usually in terms of rights, obligations, benefits to society, fairness, or specific virtues.
Data ethics #
Well-founded standards of right and wrong that dictate how data is collected, shared, and used.
GDPR #
General Data Protection Regulation of the European Union.
Aspects of data ethics #
- Ownership
- Transaction transparency
- Consent
- Currency
- Privacy
- Openness
Ownership #
Individuals own the raw data they provide and they have primary control over its usage, how it’s processed, and how it’s shared.
Transaction transparency #
All data-processing activities and algorithms should be completely explainable and understood ny the individual who provides their data.
Consent #
An individual’s right to know explicit details about how and why their data will be used before agreeing to provide it.
Currency #
Individuals should be aware of financial transactions resulting from the use of their personal data and the scale of these transactions.
Privacy #
Preserving a data subject’s information and activity any time a data transaction occurs.
- Protection from unauthorized access to oour private data
- Freedom from inappropriate use our data.
- The right to inspect, update, or correct our data.
- Ability to give consent to use our data
- Legal right to access the data
Openness #
Free access, usage, and sharing of data.
- Availability and access- Open data must be available as a whole preferably by downloading over the internet on a convenient and modifiable form.
- Reuse and redistribution - including to use it with other datasets.
- Universal participation. Everyone must be able to use, reuse, and redistribute the data.
Data interoperability #
The ability of data systems and services to openly connect and share data
“Self-reflect and understand what it is that you’re doing and the impact that it has.”
Terms and definitions for Course 3, Module 2 #
Bad data source: A data source that is not reliable, original, comprehensive, current, and cited (ROCCC)
Bias: A conscious or subconscious preference in favor of or against a person, group of people, or thing
Confirmation bias: The tendency to search for or interpret information in a way that confirms pre-existing beliefs
Consent: The aspect of data ethics that presumes an individual’s right to know how and why their personal data will be used before agreeing to provide it
Cookie: A small file stored on a computer that contains information about its users
Currency: The aspect of data ethics that presumes individuals should be aware of financial transactions resulting from the use of their personal data and the scale of those transactions
Data anonymization: The process of protecting people’s private or sensitive data by eliminating identifying information
Data bias: When a preference in favor of or against a person, group of people, or thing systematically skews data analysis results in a certain direction
Data ethics: Well-founded standards of right and wrong that dictate how data is collected, shared, and used
Data interoperability: A key factor leading to the successful use of open data among companies and governments
Data privacy: Preserving a data subject’s information any time a data transaction occurs
Ethics: Well-founded standards of right and wrong that prescribe what humans ought to do, usually in terms of rights, obligations, benefits to society, fairness, or specific virtues
Experimenter bias: The tendency for different people to observe things differently (also called observer bias)
Fairness: A quality of data analysis that does not create or reinforce bias
First-party data: Data collected by an individual or group using their own resources
General Data Protection Regulation of the European Union (GDPR): Policy-making body in the European Union created to help protect people and their data
Good data source: A data source that is reliable, original, comprehensive, current, and cited (ROCCC)
Interpretation bias: The tendency to interpret ambiguous situations in a positive or negative way
Observer bias: The tendency for different people to observe things differently (also called experimenter bias)
Open data: Data that is available to the public
Openness: The aspect of data ethics that promotes the free access, usage, and sharing of data
Sampling bias: Overrepresenting or underrepresenting certain members of a population as a result of working with a sample that is not representative of the population as a whole
Transaction transparency: The aspect of data ethics that presumes all data-processing activities and algorithms should be explainable and understood by the individual who provides the data
Unbiased sampling: When the sample of the population being measured is representative of the population as a whole
Database #
metadata #
Data about data
Relational database #
A database that contains a series of related tables that can be connected via their relationships.
Primary Key An identifier that references a column in which each value is unique.
Foreign Key A field within a table that is a primary key in another table
Metadata i used in database management to help data analysts interpret the contents of the data within the database
3 common types of metadata:
- Descriptive: Metadata that describes a piece of data and can be used to identify it at a later point in time (ISBN in a book)
- Structural: Metadata that indicates how a piece of data is organized and whether it is part of one, or more than one, data collection. (Like the index in a book)
- Administrative: Metadata that indicates the technical source of a digit asset (Like the type of photo it is)
Metadata helps data analysts confirm their data is reliable by making sure it is:
Accurate
Precise
Relevant
Timely
Consistency #
When a database is consistent, it’s easier to discover relationships between the data inside the database and data that exists elsewhere. When data is uniform, it is:
Organized: Data analysts can easily find tables and files, monitor the creation and alteration of assets, and store metadata.
Classified: Data analysts can categorize data when it follows a consistent format, which is beneficial in cleaning and processing data.
Stored: Consistent and uniform data can be efficiently stored in various data repositories. This streamlines storage management tasks such as managing a database.
Accessed: Users, applications, and systems can efficiently locate and use data.
Metadata repositories are used to store metadata—including data from second-party and third-party companies. These repositories describe the state and location of the metadata, the structure of the tables inside it, and who has accessed the repository. Data analysts use metadata repositories to ensure that they use the right data appropriately.
Metadata is stored in a single, central location, and gives the company standardized information about all of its data.
Data governance A process to ensure the formal management of a company’s data assets
Internal Data Data that lives within a company’s own systems External Data Data that lives and is generated outside an organization
Openness (Open data) Free access, usage, and sharing of data
CSV - comma-separated values Saves data in a table format
Sorting data Arranging data into a meaningful order to make it easier to understand, analyze, and visualize
Filtering Showing only the data that meets a specific criteria while hiding the rest
Data Manipulation Language (DML operations)
Terms and definitions for Course 3, Module 3 #
Administrative metadata: Metadata that indicates the technical source of a digital asset
CSV (comma-separated values) file: A delimited text file that uses a comma to separate values
Data governance: A process for ensuring the formal management of a company’s data assets
Descriptive metadata: Metadata that describes a piece of data and can be used to identify it at a later point in time
Foreign key: A field within a database table that is a primary key in another table (Refer to primary key)
FROM: The section of a query that indicates where the selected data comes from
Geolocation: The geographical location of a person or device by means of digital information
Metadata: Data about data
Metadata repository: A database created to store metadata
Naming conventions: Consistent guidelines that describe the content, creation date, and version of a file in its name
Normalized database: A database in which only related data is stored in each table
Notebook: An interactive, editable programming environment for creating data reports and showcasing data skills
Primary key: An identifier in a database that references a column in which each value is unique (Refer to foreign key)
Redundancy: When the same piece of data is stored in two or more places
Schema: A way of describing how something, such as data, is organized
SELECT: The section of a query that indicates the subset of a dataset
Structural metadata: Metadata that indicates how a piece of data is organized and whether it is part of one or more than one data collection
WHERE: The section of a query that specifies criteria that the requested data must meet
World Health Organization: An organization whose primary role is to direct and coordinate international health within the United Nations system
Benefits of organizing data #
- Makes it easier to find and use
- Helps you avoid making mistakes during your analysis
- Helps to protect your data
Best practices when organizing data #
- Naming conventions: Consistent guidelines that describe the content, date, or version of a file in its name (Use logical and descriptive names for your files to make them easier to find and use)
- Foldering: Move old projects to a separate location to create an archive and cut down on clutter
- Archiving older files
- Align you naming and storage practices with you team
- Develop metadata practices
Think about how often you0re making copies of data and storing it in different places
File names should include:
The project’s name
The file creation date
Revision version
Consistent style and orde
Data security #
Protecting data from unauthorized access or corruption by adopting safety measures.
Encryption uses a unique algorithm to alter data and make it unusable by users and applications that don’t know the algorithm. This algorithm is saved as a “key” which can be used to reverse the encryption; so if you have the key, you can still use the data in its original form.
Tokenization replaces the data elements you want to protect with randomly generated data referred to as a “token.” The original data is stored in a separate location and mapped to the tokens. To access the complete original data, the user or application needs to have permission to use the tokenized data and the token mapping. This means that even if the tokenized data is hacked, the original data is still safe and secure in a separate location.
Access control: Features such as password protection, user permissions, and encryption that are used to protect a spreadsheet
Data security: Protecting data from unauthorized access or corruption by adopting safety measures
Inbox: Electronic storage where emails received by an individual are held
Networking #
Professional relationship building
Data integrity #
The accuracy, completeness, consistency, and trustworthiness of data throughout its lifecycle.
Data replication #
The process of storing data in multiple locations.
Data transfer #
The process of copying data from a storage device to memory, or from one computer to another.
Data manipulation #
The process of changing data to make it more organized and easier to read.
Data replication compromising data integrity: Continuing with the example, imagine you ask your international counterparts to verify dates and stick to one format. One analyst copies a large dataset to check the dates. But because of memory issues, only part of the dataset is actually copied. The analyst would be verifying and standardizing incomplete data. That partial dataset would be certified as compliant but the full dataset would still contain dates that weren’t verified. Two versions of a dataset can introduce inconsistent results. A final audit of results would be essential to reveal what happened and correct all dates.
Data transfer compromising data integrity: Another analyst checks the dates in a spreadsheet and chooses to import the validated and standardized data back to the database. But suppose the date field from the spreadsheet was incorrectly classified as a text field during the data import (transfer) process. Now some of the dates in the database are stored as text strings. At this point, the data needs to be cleaned to restore its integrity.
Data manipulation compromising data integrity: When checking dates, another analyst notices what appears to be a duplicate record in the database and removes it. But it turns out that the analyst removed a unique record for a company’s subsidiary and not a duplicate record for the company. Your dataset is now missing data and the data must be restored for completeness.
The most common limitations you’ll come across and some ways you can address them.
- You can identify trends with the available data or wait for more data if time allows;
- you can talk with stakeholders and adjust your objective;
- or you can look for a new data set.
Data issue 1: no data #
Possible Solutions
Examples of solutions in real life
Gather the data on a small scale to perform a preliminary analysis and then request additional time to complete the analysis after you have collected more data.
If you are surveying employees about what they think about a new performance and bonus plan, use a sample for a preliminary analysis. Then, ask for another 3 weeks to collect the data from all employees.
If there isn’t time to collect data, perform the analysis using proxy data from other datasets. This is the most common workaround.
If you are analyzing peak travel times for commuters but don’t have the data for a particular city, use the data from another city with a similar size and demographic.
Data issue 2: too little data #
Possible Solutions
Examples of solutions in real life
Do the analysis using proxy data along with actual data.
If you are analyzing trends for owners of golden retrievers, make your dataset larger by including the data from owners of labradors.
Adjust your analysis to align with the data you already have.
If you are missing data for 18- to 24-year-olds, do the analysis but note the following limitation in your report: this conclusion applies to adults 25 years and older only.
Data issue 3: wrong data, including data with errors* #
Possible Solutions
Examples of solutions in real life
If you have the wrong data because requirements were misunderstood, communicate the requirements again.
If you need the data for female voters and received the data for male voters, restate your needs.
Identify errors in the data and, if possible, correct them at the source by looking for a pattern in the errors.
If your data is in a spreadsheet and there is a conditional statement or boolean causing calculations to be wrong, change the conditional statement instead of just fixing the calculated values.
If you can’t correct data errors yourself, you can ignore the wrong data and go ahead with the analysis if your sample size is still large enough and ignoring the data won’t cause systematic bias.
If your dataset was translated from a different language and some of the translations don’t make sense, ignore the data with bad translation and go ahead with the analysis of the other data.
*** Important note:** _Sometimes data with errors can be a warning sign that the data isn’t reliable. Use your best judgment.
Sample size #
A part of a population that is representative of the population
Random Sampling #
A way of selecting a sample from a population so that every posible type of the sample has a n equal chance of being chosen.
Sample size #
Population
The entire group that you are interested in for your study. For example, if you are surveying people in your company, the population would be all the employees in your company.
Sample
A subset of your population. Just like a food sample, it is called a sample because it is only a taste. So if your company is too large to survey every individual, you can survey a representative sample of your population.
Margin of error
Since a sample is used to represent a population, the sample’s results are expected to differ from what the result would have been if you had surveyed the entire population. This difference is called the margin of error. The smaller the margin of error, the closer the results of the sample are to what the result would have been if you had surveyed the entire population.
Confidence level
How confident you are in the survey results. For example, a 95% confidence level means that if you were to run the same survey 100 times, you would get similar results 95 of those 100 times. Confidence level is targeted before you start your study because it will affect how big your margin of error is at the end of your study.
Confidence interval
The range of possible values that the population’s result would be at the confidence level of the study. This range is the sample result +/- the margin of error.
Statistical significance
The determination of whether your result could be due to random chance or not. The greater the significance, the less due to chance.
Complete the following tasks before analyzing data: #
1 Review data integrity
Determine data integrity by assessing the overall accuracy, consistency, and completeness of the data.
Connect objectives to data by understanding how your business objectives can be served by an investigation into the data.
Know when to stop collecting data.
2 Identify what makes data insufficient Insufficient data has one or more of the following problems:
Comes from only one source
Continuously updates and is incomplete
Is outdated
Is geographically limited
3 Deal with insufficient data To deal with insufficient data, you can:
Identify trends within the available data.
Wait for more data if time allows.
Discuss with stakeholders and adjust your objective.
Search for a new dataset.
Statistical power #
The probability of getting meaningful results from a test.
Hypothesis testing #
A way to see if a survey or experiments has meaningful results.
Confidence level #
The probability that your sample size accurately reflects the greater population.
Estimated response rate: If you are running a survey of individuals, this is the percentage of people you expect will complete your survey out of those who received the survey.
Margin of error #
The maximum amount that the sample results are expected to differ from those of the actual population. The closer to zero the margin o error, the closer your results from your sample would match results from the overall population.
More technically, the margin of error defines a range of values below and above the average result for the sample. The average result for the entire population is expected to be within that range. We can better understand margin of error by using some examples below.
To calculate margin of error you need: #
- Population size
- Sample size
- Confidence level
Accuracy: The degree to which the data conforms to the actual entity being measured or described
Completeness: The degree to which the data contains all desired components or measures
Confidence interval: A range of values that conveys how likely a statistical estimate reflects the population
Confidence level: The probability that a sample size accurately reflects the greater population
Consistency: The degree to which data is repeatable from different points of entry or collection
Cross-field validation: A process that ensures certain conditions for multiple data fields are satisfied
Data constraints: The criteria that determine whether a piece of a data is clean and valid
Data integrity: The accuracy, completeness, consistency, and trustworthiness of data throughout its life cycle
Data manipulation: The process of changing data to make it more organized and easier to read
Data range: Numerical values that fall between predefined maximum and minimum values
Data replication: The process of storing data in multiple locations
DATEDIF: A spreadsheet function that calculates the number of days, months, or years between two dates
Estimated response rate: The average number of people who typically complete a survey
Hypothesis testing: A process to determine if a survey or experiment has meaningful results
Mandatory: A data value that cannot be left blank or empty
Margin of error: The maximum amount that the sample results are expected to differ from those of the actual population
Random sampling: A way of selecting a sample from a population so that every possible type of the sample has an equal chance of being chosen
Regular expression (RegEx): A rule that says the values in a table must match a prescribed pattern
Data engineers #
Transform data into a useful format for analysis and give it a reliable infrastructure.
Data warehousing specialists #
Develop processes and procedures to effectively store and organize data.
Duplicate data #
Description
Possible causes
Potential harm to businesses
Any data record that shows up more than once
Manual data entry, batch data imports, or data migration
Skewed metrics or analyses, inflated or inaccurate counts or predictions, or confusion during data retrieval
Outdated data #
Description
Possible causes
Potential harm to businesses
Any data that is old which should be replaced with newer and more accurate information
People changing roles or companies, or software and systems becoming obsolete
Inaccurate insights, decision-making, and analytics
Incomplete data #
Description
Possible causes
Potential harm to businesses
Any data that is missing important fields
Improper data collection or incorrect data entry
Decreased productivity, inaccurate insights, or inability to complete essential services
Incorrect/inaccurate data #
Description
Possible causes
Potential harm to businesses
Any data that is complete but inaccurate
Human error inserted during data input, fake information, or mock data
Inaccurate insights or decision-making based on bad information resulting in revenue loss
Inconsistent data #
Description
Possible causes
Potential harm to businesses
Any data that uses different formats to represent the same thing
Data stored incorrectly or errors inserted during data transfer
Contradictory data points leading to confusion or inability to classify or segment customers