Saturday, 29 February 2020

How to setup a python data science project?


           In data science, python3 is a powerful computational tool when working with data. It is used in a small and large organization that handles any tasks related to data analysis to be integrated with web applications or statistics code to be incorporated in the production database. Python emphasizes productivity and code readability. To write programs you need an environment for coding. Jupyter Notebook is a computational platform that will allow you to code and apply your data science skills. You can install in the command line by "pip install jupyterlab" and type "jupyter notebook" which will open your notebook.  R language is used in data analysis tasks required by standalone computing or analysis of individual servers. It focuses on user-friendly data analysis, statistics and graphical models. Datasets are the arrangement of data and can be structured in different ways. Data files are stored in a specific format. The common file format for storing data is comma-separated values or CSV where the record stored as a line and the field is separated by a comma. Pandas library in python has a read_csv method that quickly reads the file into memory. Also, the JSON format is the sort of data that gets exchanged in your data science web applications.

Big Data and Data Science: Big data isn't just bigger or lots of data. It is a qualitatively different approach to solving problems that require qualitatively different methods that are challenging and promise of big data. Big data is different from small data in the following things,
1. Goals: For small data, the goal is a specific, singular goal and trying to accomplish one task by analyzing the data Whereas the big data goals evolve and will redirect over time. You may have one at the starting point, but things can take unexpected directions.
2. Location: The small data is usually one place, or one computer file or one floppy disk. But, big data spread across multiple servers in multiple locations anywhere on the internet.
3. Data Structure and Content: Small data typically structured in a single table like a spreadsheet. But, the Big data can be semi-structured or unstructured across different sources.
4. Data Preparation: Small data usually prepared by the end-user for their own goals. It covers whois putting in, what they accomplish and why it is there. But, the big data is a team sport and can be prepared by many people who are not the end-users. The degree of co-ordination is extraordinarily advanced.
5. Longevity: Small data kept only for a limited time after the project is finished. It doesn't matter to go away after a few years or a few months. Big data are stored perpetually and become part of later projects. So, the future project might be added to existing data, historical data and data from other sources that come in. It evolves over time.
6. Measurements: Small data typically measured in standardized units using one protocol because of one person doing it and all happens one point of time. But, the big data comes in many different formats that are measured in many different units that are gathered with different protocols by different places and different times. There is no assumption of standardization or uniformity.
7. Reproducibility: In small data, the projects can be reproduced. If the data goes bad or missing, you can do it over again. With big data, the replication of data may not be possible or feasible. Bad data are identified by the forensic process and attempts to repair things or may do without it.
8. Stakes Involved: In small data, the risks are generally limited. If the project doesn't work, it usually not catastrophic. In Big data, the risks are enormous because of so much time, efforts are invested in it. It can cost hundreds of millions of dollars and the lost or bad data can doom the project.
9. Introspection /Peculiar Title: It has to do with where the data comes from and identifying the data. Small data are well organized and individual data points are easy to locate and clear metadata where they come from, what the values mean. Big data have many different files and potentially in many different formats. It is difficult to locate the data points that you are looking for. If it is not documented well that will slip through cracks. It is difficult to interpret the data exactly and what each value means.
10. Analysis: In small data, you generally analyze all data in one procedure on one machine. Big data may be broken apart and it needs to be analyzed in several different steps using different methods and combine the results at the end.
          Data scientists follow certain processes and stages. It is called the data science life cycle. The first stage of data science is to formulate a question about the problem you want to solve. The most important part of a data science project is the question itself and the creativity that comes to exploring the question. Then, acquire the data that are relevant to the problem or question. When you collect the sample data, you need to make sure that there's as little bias possible to collect the sample. There are various methods to collect the samples. Those are,
         1. Simple Random Sampling(SRS)
         2. Cluster Sampling
         3. Stratified Sampling
             One of the methods in probability sampling is called simple random sampling. It is a collection of a sample at random without replacement. For ex, the simple random sample of size of two from the population of 6 people, you need write A to F on a slip and place the slip in lot, when you take 2 slips from the lot without looking, you will get any of these samples with equal chances like AF, AE, AD, AC, AB, BF, BE, BD, BC, CF, CE, CD, DF, DE, EF, FF.  The probability sampling is the sampling method that assigns precise probabilities to the appearance of each sample. Another method of probability sampling is cluster sampling. It is taken by dividing data into clusters and using simple random sampling to select clusters. In the previous example, you can form 3 clusters of 2 people per cluster like AB, CD, EF with equal chance. This cluster sampling is an easier sampling collection method and it is used to conduct the survey. But, the disadvantage is that there is more variation in the estimation, so you need to take large samples. Stratified Sampling dividing the data into strata and producing one simple random sample per stratum. In the previous example, You need to divide into 2 strata. Strata one as A,B,C and Strata 2 as D,E,F. Then, you can use SRS to select one person from each strata and you get the sample like AD, AE, AF, BD, BE, BF, CD, CE, CF. In stratified sampling, data cannot be the same size.
       In the third stage, you need to conduct exploratory data analysis(EDA). It is to understand the data you have. Here, you can visualize data to the patterns, issues and anything related to data. Data visualization is an essential tool in data science. The rule of thumb in data science deliverables is that if there isn't a picture, then you're doing it wrong. Machine learning people are needed to know where the data is hidden and which directions were most promising to work. They can convey trends and anomalies of data more efficiently. The 2 important visualization tools in python are Matplotlib and Seaborn. It allows you to create two dimensional and multi-dimensional plots of your data. Also, it helps you to visualize the qualitative and quantitative data. 
When you perform EDA, make sure to apply the following,
        * Avoid the assumptions about the data
        * Examine statistical data types in the data
        * Examine the key properties of data
       This will helps you to find the answer to your question or the problem you want to solve. Finally, you need to use prediction and inference to draw a conclusion from the data. Inferences to quantify how certain the trends they see in their data. And they use inference to draw the conclusion
from the dataset. When you see the trends in data, you need to see the trends that occur due to random fluctuation in data collection or real phenomena. So, hypothesis testing will help you to solve this problem.
      Classification is a machine learning technique that helps the categorical predictions of data. Once, you have the data of correct categories, you need to learn from the data to make predictions in the future. For ex, the weather stations forecast tomorrow's weather from today's and previous days' weather. It is also used to predict if the patient has a particular disease. The situation when you make the prediction is called an observation. Each observation has certain aspects that are called attributes. Once the attributes are identified and the observation belongs to a specific category are defined is called a class. The goal of the classification is to correctly predict the classes of observation using the attributes. You need to go through the process repeatedly for more questions and problems. So, these are all the major stages of data science processes.

Python3 and Raspberry Pi:  Raspberry pi is a single-board computer that has all the parts on a single printed circuit board(PCB) which is the size of the credit card and individual components of SCB are not replaceable or upgradeable. Normal, desktop/laptop computers are not suitable for connecting to the sensors. In Raspberry Pi, there are 40 GPIO(general purpose input and output) Pins that depends on the model and these pins will be used to connect the sensors and any other devices. Using GPIO pins, you can directly program the hardware devices using high-level programming languages like C, C++, shell scripting, python and directly address the hardware devices that you want to do on PC. Here, you can download the Raspberry pi operating system. Just go ahead with Raspian and install the desktop environment software with minimum image-based software Debian buster. You need to use download-accelerator-plus to download the software, else it will not get downloaded properly. Then, follow the Raspian OS setup guide( ) and find the IP address of your Raspberry pi by command line. You can access the graphical user interface and desktop of Raspberry pi by enabling the Virtual Computer Networking VNC . Then, Sign up for real VNC server and download the OS that are installed on your computer. Once you provide the username and password(pi, raspberry), it will open the remote desktop. Now, you can start programming remotely. It is very useful tool when you want to run small python programs for training purposes, data visualization, data science and recording your own goals.

Saturday, 15 February 2020

What are the fundamentals of Cloud Computing and Data Science?


         Data science refers to a collection of related disciplines focusing on the use of data to create new information and technology. It provides useful insights for better decisions. For ex, big data overcomes the challenge of analyzing the huge volume and modern data generated at high speed. In the real world, computing devices such as cellphones, security cameras are constantly generating data and connected to the internet, also known as IOT is ever-growing. Computers can make decisions based on trusted algorithms to make accurate predictions. Data analytics is a more enhanced way of taking advantage of exponentially increasing computing power and storage capacity. You need the basic knowledge of statistics to be a successful data scientist. Basically, the data industry is driven by IT in the languages like python, R which comes with powerful libraries that implement statistical functions and visualization features. The programmer or data scientist automate the necessary tasks and focus on solving large problems. Distributed file system like Hadoop and distributed processing like Spark plays a critical role in big data that enables you to make informed decisions. Machine Learning helps to detect data patterns and make better predictions about a dataset. For ex, In fraud detection, machine learning dramatically reduces the workload by a significant number of data points and presents only the suspicious candidates. The visualization tools can greatly enhance the presentation. Data scientists need to specialize in core job duties in particular area.
          Data science requires support from cloud computing and virtualization for the ever-increasing size, speed and accuracy requirements for the data sets we have to manage. Cloud computing provides the scalability requirement for computing resources. Actually, the cloud provides the processing power and storage space. The software application connects virtual machines through a high-speed network and implements distributed file and processing systems. Hadoop and Spark are the key elements that build on virtual machines. It solves data science problems by connecting the specific data science application. Cloud computing, virtualization, machine learning, and distributed computing are technologies for data scientists to do their job effectively. Proxmos is easy to install for cloud computing and virtualization to build your own cloud and configure the software. Weka is a machine learning tool that allows users to run various machine learning algorithms in a GUI environment.

Fundamentals of Cloud Computing: If you want to familiarize yourself with Azure computing, first you need to familiarize yourself with cloud computing as a whole. There are 3 types of cloud computing. Those are,
  1. Public Cloud
  2. Private Cloud
  3. Hybrid Cloud
       When we talk about the infrastructure, you need to know the infrastructure deployed in your company and you need to manage the server, hardware, services, firewalls managed in your organization by internal administrator who is responsible for the functions and functionalities for the user. The user consumes the services and you need to update or upgrade and manage the hardware that services live on. In a private cloud, the user could be an administrator who has a portal based environment from which they manage the environment, provision servers, deployed applications, websites, and all the things. It depends on the software that manages the private cloud and that exposes all the functionalities in the portal. For ex, the System Center 2012 R2 by Microsoft provides the private cloud infrastructure. It is typically a private data center so that you will be responsible for hardware, software, and network services. The vendor is responsible for most tasks that are performed in the public cloud like Microsoft Azure, Google public cloud, AWS. It uses a leasing base model which is basically pay as you go or use infrastructure that you consume resources of workloads, applications and services. The usage can be the data stored in the cloud infrastructure and services offered by virtual machines. The advantage of public cloud infrastructure is that you can deploy a new application or server at a very low cost and you don't need new hardware to support the additional infrastructure. Ultimately, it reduces the capital expenditure of the company. The Hybrid cloud is the mix of public and private solutions where you can have your own internal private data center, store workloads with some services, and applications into the public cloud. It is more complex to manage because you have to manage both environments with coexistence.

Cloud Computing Services: It is a collection of remote servers connected via computer networks available through internet. Virtualization implements cloud computing. It uses the Hypervisor operating system on which many OS can be installed like Windows and Linux. You can fire up virtual machines and leverage vast resources of cloud provider when the business operation grows gradually and exponentially. It is the flexibility to grow your infrastructure quickly if necessary.  Cloud computing companies specialize in managing server farms and know-how to maximize the profit and minimize the expenses. There are 3 major deployment models of cloud. Those are,
   1. Infrastructure as a Service(IaaS) - The computer lab is an infrastructure that you are trying to use as a service. For ex., If you want 10 PCs, you can use AWS EC2 and start 10PCs and put them in the same network. Now, this is our computer lab.
   2. Platform as a Service(PaaS) - It is the platform to run your code. You can just go to the cloud and tell them which compiler and interpreter you want and can run it. The cloud IDE can used to write and run the code.
   3. Software as a Service(SaaS) - It is self-explanatory. For ex, the google docs and dropbox which gives free storage and this is a cloud service software for your purposes.

Application Migration to Cloud:  A successful migration of the large portfolio requires a couple of things. Those are,

    * Think and Plan strategically and,
    * Rapidly iterate through feedback loops to fix the things that are going wrong

      But, there are a lot of things to consider when migrating to cloud. It includes application architecture, the ability to scale-out, distributed nature etc., When we are migrating the applications to the cloud, we are getting a new architecture that's going to have different properties or different characteristic than traditional systems. The advantage of cloud migration is the ability to do a active-active architecture. It means we run the application in real-time at the same time. One application takes over the other if there's a failure. Here, we are automating the things and the goodness of being in the cloud is worth to the business. The application migration are necessary, because
   * you are selling to the stakeholder that are funding to the cloud migration.
   * It makes the business more agile and delivers the value.
   * We are understanding the applications in wide for the specific needs of the application and looking at the general consensus.
   * We are modernizing the things in moving the database models, technologies, improving the security and governance and leveraging the systems whatever the purpose we need.

The important steps in Cloud Migrations are,                                                                         
Ultimately, there is a bit of trial and error, so set the operation processes and continuous improvement.

Data Migration to Cloud: Data is the highest priority when migrating to the cloud. Basically, data is the business and it is everywhere in enterprise. The data is killer application of cloud computing. We are migrating to the cloud and finding more values in new ways in innovations through running databases, big data systems, predictive analysis, and AI based systems in the cloud. So, the data selection is a critical process to understand which database is bound to which applications, what they're doing, security issues, compliance and performance issues that leads to success. In the Business case, migration, testing, and deployment are the understanding of data. You need to look at the applications that depend on data and do the deployment. Ultimately, the goal of leveraging data is to lowering operational costs, integrating existing data silos to make different databases communicate one another as a single dataset and influence actions and outcomes but not just data so they have the information they need to run the business better. We are not going to move every piece of data that exists on-premises into the cloud. We may move 70% of it and we have to deal with integration with on-premise data stores and those that exist in the cloud. So, make sure to build a solid architectural foundation for success when considering data, avoid duplicate data and data silos. In a real-world scenario, you need to consider the following things when you migrate to cloud,
 * It is necessary to understand the total cost of ownership(TCO) for the first year, second year, five years etc., It includes the TCO for applications, for databases, for cloud instances and ROI. The top 5 TCO/ROI are,
   - Value of Agility
   - Cost to retire selected applications, infrastructure or data centers
   - Changes required to maintain a service level
   - Software costs
   - Organizational transformation costs
 * Ensure that the solid business case exists before the migration can begin and how the technology going to be applied
 * The value metrics or value points that need to be determined like including the agility, compressed time to market, cost savings etc., by which you will be measured against the total cost