Data Crawling and Clustering The Complaints Center of Surabaya Residents

Published in

Data Science Indonesia

5 min readJan 21, 2020

https://m.solopos.com/surabaya-kota-terpopuler-dunia-2018-957650

Actually I just finished this project 9 month ago. But I am too lazy to write it on medium haha. Now, I am just trying to make it more readable and easy to learn for everyone. Here we go :

Surabaya has a portal that collect all the complaint from the Surabayan People about the service of the city (especially government services). Kindly check this website https://mediacenter.surabaya.go.id to see more detail.

What we do now ? okay first I really curious about that data. You can see on the website that all the complaints is text based. So, can the data be analyzed? Sure, that possible ! But first we need to take out the data, we need to save it into file or something that make us easy to read. So here I am using scrapy on python to do the crawling job. Here’s my code :

class MediacenterTabletsSpider(scrapy.Spider):
  name = 'mediacenter_tablets'
  allowed_domains = ['mediacenter.surabaya.go.id']
  start_urls = ['http://mediacenter.surabaya.go.id/']  custom_settings={ 'FEED_URI': "mediacenter_%(time)s.csv",  'FEED_FORMAT': 'csv'}  def parse(self, response):
     print("procesing:"+response.url)     product_name=response.xpath('//div[@onclick]/text()').extract()     #Extract data using xpath     row_data=zip(product_name)     #Making extracted data row wise     for item in row_data:
        #create a dictionary to store the scraped info        scraped_info = {
            #key:value
            'page':response.url,
            'product_name' : item[0] 
            #item[0] means product in the list and so on, index tells what value to assign }        #yield or give the scraped info to scrapy        yield scraped_info
        NEXT_PAGE_SELECTOR = '//li[contains(@class,"page-item active")]/following-sibling::li[1]/a/@href'        next_page = response.xpath(NEXT_PAGE_SELECTOR).extract()        if next_page[0]:
           yield scrapy.Request(response.urljoin(next_page[0]),callback=self.parse)

On that day, I was using spyder IDE on anaconda. There are so much tutorial on the internet to help out through this scrapy features. I need 2 days to learn how to get that data using scrapy ( especially for paging data website ).

So whats next ? after crawling process, I save that data into CSV file format, “mediacenter_%(time)s.csv”.

It’s time to open jupyter notebook hahaha

After having the ‘dataset’, we are ready to analyze that data. The first thing that comes to mind my first thought is what if this data is clustered. So we know what types of complaints are raised by the community. So yeah I did clustering.

First I checked the data, read it on pandas and I got the result like that. The data looks not good enough (even we got the text data on index 2,6, and so on). So I did the cleansing steps to drop the data that have no meaning such as ‘\n’.

So I filter the data, I choose the data that has len text above 21 (This is just my assumption, you can use another approach to that).

As in the NLP task, some things that can be done in terms of preprocessing are deleting some characters that have no meaning. So I did regex on that dataset on my code below

Self documentation

After I got the clean dataset (I guess so), I did some preprocessing steps. First I did the “stemmer” on that data. Stemmer is a process that convert word into basic forms. I was using sastrawi as a library. Then I did one final preprocessing steps which was “stopword removal”.

After that, to do feature extraction I used TF-IDF. Here is my code

After I got features from that, I used KMeans for modelling. I was using silhouette score and distorsion to evaluate the model to choose the number of cluster.

Then I choose the number of clusters 26 based on the evaluation above.

So that’s all? no, we need to show what kind of topics that created by KMeans

So here we are. The are 26 topics based on the number cluster. There are some topics that have no relevance to the complainant. We can choose which topic that has good meaning and represent the complaint. What’s next? I think this is a good question. Much of data just ends with this ( just a notebook ). I hope I can use it to help the goverment to support the systems so they don't need to read all the text one by one and send it into department unit. But the journey is still far away. Like on Avicii's song lyrics, I can’t tell where the journey will end, But I know where to start.

Maybe this the start to be able to see the conditions around us and help as much as we can. There are still many mistakes and time to continue learning for me based on this self projects. If there is anything to discuss, please I will be happy to discuss it. So, what city will be analyzed next?

Data Crawling and Clustering The Complaints Center of Surabaya Residents

Written by Adam WB