Data science for personalized trend detection

 

During my work as a data scientist I carefully follow the trends in the field of artificial intelligence. This post describes one task I regularly perform to analyze the value proposition of different companies During the last years I collected data and the value proposition of more than 2k data science companies.

When playing with the data pool I do most of the time fast prototyping while working with new service in the market to see if they meet my expectations. In part, I have tested different Machine Learning Service from different providers (mainly SAP and Azure) for topic detection. As a data scientists I always try to combine a purpose (business understanding) with IT and machine learning know-how.

Since data science is more than machine learning I would like to point out the different techniques I used in the background for the rapid prototyping. Rapid prototyping or hacking is just like playing ‘LEGO’ - you have to know the right libraries / services. Please remember the rule of thumb:

from prototype to production requires x100 more effort

Data Science work should always follow a certain iterative data science process: value -> model -> evaluation -> usage. Otherwise, you get lost in time spent by falling in love with real fun - machine learning.

One overall summary of this snapshot: 80% of the time is spent for data crawling / cleaning. I changed my own business objective of this exercise 3 times :-)

  • Business Objective (personal): always being up to with new data science trends, checking value propositions, and perform automated ,digital maturity’ test of companies
  • Data Crawling / Cleaning: I realized to my surprise that twitter and of cause ,hacker news’ could be a good starting point to learn the latest trends. Most of the time I use available python packages (example description) for web crawling or a local tika server (launched via docker) to extract metadata and text from different file types (such as PPT, XLS, and PDF).
  • Model Building / Value Enrichment: Google geocoding API to convert raw addresses to Lat & Lng Coordinates, machine learning services for topic detection and translation service. One technique / library I like a lot from natural language processing is automated text summarization. Note, that all these techniques are using machine learning techniques to extract or enrich value information.
  • Data Visualization: often I use Leaflet maps (see figure, only 100 locations), it gives me a dynamic mapping of company locations, their value messages and the direct link for more information. In the background I implemented search options to filter companies on specific topics ( e.g. Supply Chain + Data Science).
  • Usage: most of the time I explain these techniques in my data science lecture at the Technical University of Kaiserslautern. Additionally, in the last year I realize more and more the real value and great possibilities of the already cleaned and analyzed data. For any questions or comments, just reply to this post or send me a message.

Leaflet example

  • Zoom out to see all companies
  • Click on location to see the extracted text from web Crawling (done in 2018)
  • Only 200 companies are included

Some Python snippets

Popup via folium wrapper. Manipulating the data in python and visualize interactive maps via leaflet

1
2
3
4
5
6
  import folium
  import json
  from pprint import pprint

  m = folium.Map(location=[45.5236, -122.6750])
  folium.Marker([45.5236, -122.6750], popup='<i>Mt. Hood Meadows</i>').add_to(m)