Saturday, 29 February 2020

How to setup a python data science project?

     

           In data science, python3 is a powerful computational tool when working with data. It is used in a small and large organization that handles any tasks related to data analysis to be integrated with web applications or statistics code to be incorporated in the production database. Python emphasizes productivity and code readability. To write programs you need an environment for coding. Jupyter Notebook is a computational platform that will allow you to code and apply your data science skills. You can install in the command line by "pip install jupyterlab" and type "jupyter notebook" which will open your notebook.  R language is used in data analysis tasks required by standalone computing or analysis of individual servers. It focuses on user-friendly data analysis, statistics and graphical models. Datasets are the arrangement of data and can be structured in different ways. Data files are stored in a specific format. The common file format for storing data is comma-separated values or CSV where the record stored as a line and the field is separated by a comma. Pandas library in python has a read_csv method that quickly reads the file into memory. Also, the JSON format is the sort of data that gets exchanged in your data science web applications.

Big Data and Data Science: Big data isn't just bigger or lots of data. It is a qualitatively different approach to solving problems that require qualitatively different methods that are challenging and promise of big data. Big data is different from small data in the following things,
1. Goals: For small data, the goal is a specific, singular goal and trying to accomplish one task by analyzing the data Whereas the big data goals evolve and will redirect over time. You may have one at the starting point, but things can take unexpected directions.
2. Location: The small data is usually one place, or one computer file or one floppy disk. But, big data spread across multiple servers in multiple locations anywhere on the internet.
3. Data Structure and Content: Small data typically structured in a single table like a spreadsheet. But, the Big data can be semi-structured or unstructured across different sources.
4. Data Preparation: Small data usually prepared by the end-user for their own goals. It covers whois putting in, what they accomplish and why it is there. But, the big data is a team sport and can be prepared by many people who are not the end-users. The degree of co-ordination is extraordinarily advanced.
5. Longevity: Small data kept only for a limited time after the project is finished. It doesn't matter to go away after a few years or a few months. Big data are stored perpetually and become part of later projects. So, the future project might be added to existing data, historical data and data from other sources that come in. It evolves over time.
6. Measurements: Small data typically measured in standardized units using one protocol because of one person doing it and all happens one point of time. But, the big data comes in many different formats that are measured in many different units that are gathered with different protocols by different places and different times. There is no assumption of standardization or uniformity.
7. Reproducibility: In small data, the projects can be reproduced. If the data goes bad or missing, you can do it over again. With big data, the replication of data may not be possible or feasible. Bad data are identified by the forensic process and attempts to repair things or may do without it.
8. Stakes Involved: In small data, the risks are generally limited. If the project doesn't work, it usually not catastrophic. In Big data, the risks are enormous because of so much time, efforts are invested in it. It can cost hundreds of millions of dollars and the lost or bad data can doom the project.
9. Introspection /Peculiar Title: It has to do with where the data comes from and identifying the data. Small data are well organized and individual data points are easy to locate and clear metadata where they come from, what the values mean. Big data have many different files and potentially in many different formats. It is difficult to locate the data points that you are looking for. If it is not documented well that will slip through cracks. It is difficult to interpret the data exactly and what each value means.
10. Analysis: In small data, you generally analyze all data in one procedure on one machine. Big data may be broken apart and it needs to be analyzed in several different steps using different methods and combine the results at the end.
          Data scientists follow certain processes and stages. It is called the data science life cycle. The first stage of data science is to formulate a question about the problem you want to solve. The most important part of a data science project is the question itself and the creativity that comes to exploring the question. Then, acquire the data that are relevant to the problem or question. When you collect the sample data, you need to make sure that there's as little bias possible to collect the sample. There are various methods to collect the samples. Those are,
         1. Simple Random Sampling(SRS)
         2. Cluster Sampling
         3. Stratified Sampling
             One of the methods in probability sampling is called simple random sampling. It is a collection of a sample at random without replacement. For ex, the simple random sample of size of two from the population of 6 people, you need write A to F on a slip and place the slip in lot, when you take 2 slips from the lot without looking, you will get any of these samples with equal chances like AF, AE, AD, AC, AB, BF, BE, BD, BC, CF, CE, CD, DF, DE, EF, FF.  The probability sampling is the sampling method that assigns precise probabilities to the appearance of each sample. Another method of probability sampling is cluster sampling. It is taken by dividing data into clusters and using simple random sampling to select clusters. In the previous example, you can form 3 clusters of 2 people per cluster like AB, CD, EF with equal chance. This cluster sampling is an easier sampling collection method and it is used to conduct the survey. But, the disadvantage is that there is more variation in the estimation, so you need to take large samples. Stratified Sampling dividing the data into strata and producing one simple random sample per stratum. In the previous example, You need to divide into 2 strata. Strata one as A,B,C and Strata 2 as D,E,F. Then, you can use SRS to select one person from each strata and you get the sample like AD, AE, AF, BD, BE, BF, CD, CE, CF. In stratified sampling, data cannot be the same size.
       In the third stage, you need to conduct exploratory data analysis(EDA). It is to understand the data you have. Here, you can visualize data to the patterns, issues and anything related to data. Data visualization is an essential tool in data science. The rule of thumb in data science deliverables is that if there isn't a picture, then you're doing it wrong. Machine learning people are needed to know where the data is hidden and which directions were most promising to work. They can convey trends and anomalies of data more efficiently. The 2 important visualization tools in python are Matplotlib and Seaborn. It allows you to create two dimensional and multi-dimensional plots of your data. Also, it helps you to visualize the qualitative and quantitative data. 
 
When you perform EDA, make sure to apply the following,
        * Avoid the assumptions about the data
        * Examine statistical data types in the data
        * Examine the key properties of data
       This will helps you to find the answer to your question or the problem you want to solve. Finally, you need to use prediction and inference to draw a conclusion from the data. Inferences to quantify how certain the trends they see in their data. And they use inference to draw the conclusion
from the dataset. When you see the trends in data, you need to see the trends that occur due to random fluctuation in data collection or real phenomena. So, hypothesis testing will help you to solve this problem.
      Classification is a machine learning technique that helps the categorical predictions of data. Once, you have the data of correct categories, you need to learn from the data to make predictions in the future. For ex, the weather stations forecast tomorrow's weather from today's and previous days' weather. It is also used to predict if the patient has a particular disease. The situation when you make the prediction is called an observation. Each observation has certain aspects that are called attributes. Once the attributes are identified and the observation belongs to a specific category are defined is called a class. The goal of the classification is to correctly predict the classes of observation using the attributes. You need to go through the process repeatedly for more questions and problems. So, these are all the major stages of data science processes.

Python3 and Raspberry Pi:  Raspberry pi is a single-board computer that has all the parts on a single printed circuit board(PCB) which is the size of the credit card and individual components of SCB are not replaceable or upgradeable. Normal, desktop/laptop computers are not suitable for connecting to the sensors. In Raspberry Pi, there are 40 GPIO(general purpose input and output) Pins that depends on the model and these pins will be used to connect the sensors and any other devices. Using GPIO pins, you can directly program the hardware devices using high-level programming languages like C, C++, shell scripting, python and directly address the hardware devices that you want to do on PC. Here, you can download the Raspberry pi operating system. Just go ahead with Raspian and install the desktop environment software with minimum image-based software Debian buster. You need to use download-accelerator-plus to download the software, else it will not get downloaded properly. Then, follow the Raspian OS setup guide( bitwise.com/ssh-client-download ) and find the IP address of your Raspberry pi by command line. You can access the graphical user interface and desktop of Raspberry pi by enabling the Virtual Computer Networking VNC . Then, Sign up for real VNC server and download the OS that are installed on your computer. Once you provide the username and password(pi, raspberry), it will open the remote desktop. Now, you can start programming remotely. It is very useful tool when you want to run small python programs for training purposes, data visualization, data science and recording your own goals.

No comments:

Post a Comment