Page Nav

HIDE

Breaking News:

latest

Ads Place

5 Tools Every Data Scientist Needs to Master

https://ift.tt/wgrb7ea From plotting to version control… Photo by Susan Holt Simpson on  Unsplash Data Science is a field that can eng...

https://ift.tt/wgrb7ea

From plotting to version control…

Photo by Susan Holt Simpson on Unsplash

Data Science is a field that can engage with a wide variety of topic domains from biology and finance to geography and retail. This means that Data Science projects can have a variety of formats and have wildly different challenges. This does not mean however that there aren’t some commonalities between them. In my previous post, I described five stages that every Data Science project will go through. Here I will cover five tools that every Data Scientist should master to be able to work through those stages as smoothly as possible.

These tools are:

  • A plotting library: To produce easy, quick, and professional visualisations.
  • A mathematical library: For those calculations just beyond the base capabilities of the language you code in.
  • A Library to hold your data: To quickly and simply load, interact and feed your data into your models.
  • SQL: For when working with data beyond the capabilities of your computer's memory.
  • Version Control: To be able to effectively manage your files and work with others.

Mastery of which will help with your ability to complete Data Science projects to the highest standards!

A plotting library

One of the most important tools in most Data Scientist's toolkits should include the mastery of at least one plotting library, regardless of the language chosen. The benefit of being able to master a plotting library will be the ability to create effective visualisations to showcase your data, method, and results. When they say that a picture is worth a 1,000 words, this is especially so in the field of Data Science. Clear visualisations can allow you to effectively communicate with a variety of different stakeholders, increasing the value and reach of your projects.

Mastery of the chosen plotting library should include how to simply and quickly create simple plots. This will benefit you when you are stuck, such as understanding the data, not knowing how to progress or confirming that you are heading in the right direction. 9 times out of 10, these challenges can be resolved simply by looking at a quick and dirty plot. This does not mean you have to include labels, correctly colour the plot or even add a title, this plot will only be seen by you so as long as you understand it you will be okay.

Being able to create these simple plots will then also help you lay a foundation to create more complex visualisations that can be added to publications, presentations, or shown to leadership. Creating these plots will often take a lot longer to produce and require a much deeper knowledge of the chosen library but will ultimately allow you to communicate your process or results to a much wider audience. To do this you will need to know how to add legends, labels, annotations, manipulate colors and combine multiple plots on the same chart, among many other things. These plots will be able to support the story you are trying to tell to your audience in a way that no words could convey in such a short space and will add value to your final result.

I suggest starting out learning a single library because syntax and implementation can vary between them and often one library will be able to cover the majority of use cases. Then, once you mostly understand the first library and its limitations, you can seek out others to add to your repertoire and fill the gaps that the current library can’t yet do. In Python, a common library that most people start of learning is Matplotlib. This is because it is very easy to learn and can be extremely versatile once you understand some of the more complex functionality, leaving very gaps that need to be filled by other libraries. Beyond that there is also plotly, seaborn, ggplot, and others that can be used ontop of, with or instead of Matplotlib that offer different advantages and disadvantages.

An introduction to plotting with Matplotlib in Python

A mathematical library

A common tool that should appear in most Data Scientists' toolkit would be that of a core mathematical and/or statistical library. This is because while most languages will have inbuilt mathematical operations, in most Data Science workflows a lot of processes can be made easier with the use of a library on top of that existing functionality. The benefits from this can include the ability to easily use mathematical constants or notation, perform advanced calculations or operations quickly and efficiently, whilst also storing large amounts of data.

Learning a mathematical or statistical library can be useful especially for more complex mathematical methods that you would want to integrate into your Data Science Workflow. This can include some complex operations using large amounts of data or some types of statistical analysis. But learning a mathematical library can also be useful to learn because in the majority of languages some of the more advanced libraries will build on top of existing ones. An example of this is Sci-kit Learn, arguable the go-to machine learning library in Python, taking advantage of a lot of the inbuilt functionality of Numpy to perform some of its calculations and machine learning algorithms. By knowing the functionality of one of the base libraries you are often best placed to take advantage of the more advanced one and be able to know what is going on under the hood of your models.

In Python, the go to library for this is Numpy. This is because of the rich depth of functionality and ease of use of this library, such as the implementation of Numpy arrays which can store information more efficiently than Python lists whilst also allowing for quicker and more advanced data manipulation functionality. This includes the ability to quickly and efficiently perform calculations on a much greater scale than you would be able to otherwise. If you want to get started with this library, feel free to see the introduction below:

UCL Data Science Society: Introduction to Numpy

A Library to Hold Your Data

The third tool that must be in every Data Scientists' toolkit must be a library that can be used to store, hold and manipulate your data. While many languages have inbuilt data structures, their functionality is often limited when it comes to Data Science workflows. An effective library should be able to read in data from a variety of formats, allow you to perform basic calculations, feed into common plotting libraries (or be able to produce them itself) and be able to integrate with a variety of machine learning models to feed in the data.

For this tool to be used effectively, there are three actions that you must be able to perform. The first is being able to load in data from a variety of sources and from a variety of formats. Data Science projects often involve Data from many different places so a library that can handle all of these will be incredibly useful. Mastery of this will enable you to load in data and start working much quicker than using a variety of libraries or tools which would otherwise increase the complexity of any Data Science workflow. While one library may not be able to load in from all data sources, it should be able to handle at least the most common ones such as csvs, excel files, text files, json formats, and from databases.

A Data Scientist should also be able to to perform simple data manipulations with the chosen library. This would include being able to perform simple calculations between columns or rows, selecting an item or items based on certain conditions, to group data together to get summary statistics and to be able to create subsets of the data. This will allow any Data Scientists to get a feel and overview of the data before any visualisation or modelling has been performed, helping to narrow down the next steps. In the early stages of your project, you want to make sure that any data manipulation can be done quickly and effectively, which will be made a lot easier with the help of a good library.

The final action you must be able to perform is to pass the data structure into another library for visualisation or modelling. In most cases, while this library will have some inbuilt functionality for visualisation or calculations, you will often lean on other libraries for more in depth visualisations or the use of machine learning algorithms. Being able to effectively create the data structure you want to pass into these other libraries is vital to create an effective Data Science workflow within your project. This could involve many of the simple data manipulations above but you need to ensure that the data can be passed in the right format for the tool chosen.

Which library you will choose for this will often be highly dependent on the language you are using because each language has their own implementation of different data structures. This can make porting from one language to another often very difficult and you have to get used to the nuances of each new library. In Python, the most common library that is used to efficiently store data is that of Pandas. This is because pandas DataFrames are an efficient way to be able to store, manipulate and extract data, along with being integrated with the majority of Machine Learning libraries in the Python Ecosystem, making it effectively like a Data Science Swiss Army Knife.

UCL Data Science Society: Pandas

SQL

Beyond the tools mentioned above, which will often depend on the language you choose for your Data Science workflow, most Data Scientists should at least have a basic understanding of Structured Query Language. This will allow you to interact with data from a variety of sources and in such a way that you can easily extract the data that you are interested in. This is especially beneficial for data science projects that interact with large datasets that would simply be too large to efficiently store or manipulate in common data formats or structures or for storing data when it is not being used.

Data Scientists should at least learn the basics of SQL to interact with structured databases. These skills include being able to select data and filter data, just as with the library to hold your data, so that you can extract only the information that you are interested in. In some cases this filtering has to be done at the database level, rather than the workflow level, because otherwise you may not have enough computational power to deal with the amount of data passed. Doing this at an early stage as well allows you to only load in the data you are interested in and hence reduce the required amount of computational resources in the workflow. This can save you both time and money in the long run.

In addition to this, you should also be able to perform simple groupings and joins of data. The groupings allows you to extract easy summary statistics which will enables you to get an overall view of the data before you create any workflows, providing a grounding in the data you are working with. Whereas joins allows you to effectively connect data from different tables or views within the datasets such that you don’t have to create repeated instances of the same data but instead can manipulate it to get what you want.

An introduction to SQL for Data Scientists

Version Control

The final tool that all Data Scientists should have in their toolkit is that of version control which often equates to git. This is very important for long workflows or collaborative projects as without version control they can get very messy. Nevertheless, even if you work on a project alone and most projects you work on are small, you should still be using some form of version control. If for nothing else, this will reduce the amount of v1, v2, v3, final, final v2 naming conventions commonly seen in many work folders I’m sure.

Knowing, understanding and using version control should at least cover being able to effectively manage a local repository. This includes being able to store current workflows, make new commits to the repository, and navigate through previous commits to undo or remake changes. Such understanding would allow Data Scientists to effectively manage any Data Science project in such a way that any changes or errors introduced into the workflow can be removed or reversed if need be and that multiple instances of the same file are not needed. We’ve all been there with v1, v2, v3, final, final final, final final final versions of files. More advanced knowledge would then include how to create branches and merge or rebase them back into a mainline branch so as to isolate feature development and keep a working product stable.

Data Scientists should also know how to integrate their local repository with a remote one. This would not only ensure that there is a backup of all the work that you’ve done (very important!) but also enable you to collaborate effectively with others. This includes understanding how to push to a remote repository, pull from the remote repository, create and merge feature branches and deal with merge conflicts. Such skills should hopefully enable easy and simple collaboration among a team who are working on the same project while reducing the number of merge conflicts or issues of others changing the same code you are working on. This is more than likely essentially in modern Data Science projects where a widely dispersed team, in terms of both geography and time difference, can be working on a single project.

Learning git will likely be a forever long journey, but understanding the basics will benefit any Data Scientist. This does not mean they have to be git power users who exclusively use the command line (there are many useful GUIs out there!) but any Data Scientist must at least know the basic functionality. A good place to start with this would be the introduction below:

Git and GitHub basics for Data Scientists

Conclusions

Data Science can be a tricky field to navigate but there are five common tools that will make any Data Scientists life easier once they have mastered them. These include a library that can be used for plotting effectively, a library that can be used to perform basic mathematical calculations, a library to store your data in for your workflows, SQL and version control. While these will in no way make every Data Science project a breeze, they will definitely improve both the final product and the process! Good luck!

If you liked what you read and are not yet a medium member, feel free to sign up to Medium using my referral link below to support me and other amazing writers on this platform! Thank you in advance.

Join Medium with my referral link - Philip Wilkinson

Or feel free to check out some of my other articles on Medium:


5 Tools Every Data Scientist Needs to Master was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.



from Towards Data Science - Medium https://ift.tt/4B0Vgm3
via RiYo Analytics

No comments

Latest Articles