Report on OpenStreetMap Data Cleaning Project

Report on OpenStreetMap Data Cleaning Project

Contents

  • Map area
  • Data Overview
    • Files size : know the size of the selected file
    • Tags number : know how many tags in the files
  • Data cleaning
    • Main problems of data
    • Results after updating
  • Data Exploration from sql data

    • Size of files
    • Top 10 contributing users of Beijing map
    • Hutong numbers in Beijing
    • Source distribution
  • Problems encountered and solutions

  • Other ideas about this data set
  • Conclusions

Map area

Beijing, China

Beijing is the capital of China with a lot of acient buildings and modern good-designed architectures. I would like to explore some information about our great capital.
The link of this file is: link

Data Overview

  • <font size =4>Files Size
    The size of the file is **164 megabytes**, this is qualified with the requirement of udacity course
In [60]:
import os
print ('beijing_china.osm:{} MB'.format(os.path.getsize('beijing_china.osm') >> 20))
beijing_china.osm:164 MB
  • <font size = 4>Tags number </font>
    Before cleanig data, I want to have a overview of data that I have, the first thing I wanna check is tags number, and which is counting by this function named count_tag()
    Outputs:
    <font color = 'blue' size=3>{'osm': 1, 'bounds': 1, 'node': 777694, 'tag': 339853, 'way': 115041, 'nd': 928128, 'relation': 5601, 'member': 60922}
In [ ]:
def count_tags(filename):
    tags = {}
    taglist = []
    for _,elem in ET.iterparse(filename,events=('start',)):
        taglist.append(elem.tag)
    for tag in taglist:
        if tag not in tags:
            tags[tag] = 1
        else:
            tags[tag] += 1
    return tags

Data cleaning process

1. Main problems of data

  • Way and node names: There are a lot of way names, most of are in Chinese, but some are in English and Pinyin, such as **Rongxian Hutong and Embassy of Germany** , those names need to be converted into Chinese characters for further exploration.
  • Phone format problems: some phone is written as <font color = blue>'01087671788'</font>,but another is writen as <font color = blue>'+86 010 69618888'</font>
  • Node sourse: samples node sourse is like <font color = blue>'{'Bing', 'GPX', 'bing', 'gps','Bing, 2005-04'}'</font>
  • Cuisine: samples cuisine is like <font color = blue>'{'american;burger', 'chinese', 'german'}'</font>,which will changed to country name

2.Data cleaning function

The function used in the part is shown as below, names of nodes and ways in pinyin will be repalce into Chinese charactes, such as 'Rongxian Hutong' will be changed into 'Rongxian 胡同', translating pinyin into Chinese is tricky, so I did nothing with that, and actually, I pay more attention to the type of a node or way(such as '胡同' or '公路') instead of it's special name.

In [1]:
# functions for updating cuisine phone and source of ways or nodes
def mapping(name,mapping_dict):
    name_list = name.split()
    if name_list[-1] in mapping_dict:
        name_list[-1] = mapping_dict[name_list[-1]]
        return ' '.join(name_list)
mapping_dict = {
    'Road':'路',
    'Expressway':'高速公路',
    'Lu':'路',
    'lu':'路',
    'Hutong':'胡同',
    'hutong':'胡同',
    'Embassy':'大使馆',
    'Coffee':'咖啡厅'
}

def update_cuisine(cuisine):
    return cuisine.split(';')[0]
def update_phone(phone):
    if len(phone) == 8:
        return '+86010'+phone
    elif len(phone) == 11:
        return '+86'+phone
    elif len(phone) >11:
        return phone.replace(' ','')
    elif len(phone) <8:
        return 'Error'+'+'+phone
    elif len(phone) >8 and len(phone) < 11:
        return 'Error'+'+'+phone
def update_source(source):
    import re
    sourcename = re.split(r'[,;. ]',source)
    sourcename_0 = sourcename[0]
    reg_bing = re.compile(re.escape('bing'), re.IGNORECASE)
    reg_gps = re.compile(re.escape('gps'), re.IGNORECASE)
    reg_yahoo = re.compile(re.escape('yahoo'), re.IGNORECASE)
    reg_survey = re.compile(re.escape('survey'), re.IGNORECASE)
    sourcename_0 = reg_bing.sub('Bing', sourcename_0)
    sourcename_0 = reg_gps.sub('GPS', sourcename_0)
    sourcename_0 = reg_yahoo.sub('Yahoo', sourcename_0)
    sourcename_0 = reg_survey.sub('Survey', sourcename_0)
    if sourcename_0 not in ['Bing','GPS','Yahoo','Survey']:
        sourcename_0 = 'other'
    return sourcename_0

3.Results after updating

The data after cleaning via update_functions will be saved to the database:

  • source:There are lots of sources, I just clean main source such as bing, gps etc. but some other source like 'local' and 'map.sogou.com' are also been found, those source can be classified as 'others'.
  • phone:The phone format here is more complex than what I thought, I did not do much work on this since few things can be done by phone number analysis.
In [53]:
# source updating, 
import sqlite3
conn = sqlite3.connect('p3_project.db')
c = conn.cursor()
c.execute('select value,count(*) as num from ways_tags where key = "source" group by value order by num desc').fetchall()
Out[53]:
[('Bing', 3319), ('Yahoo', 1027), ('GPS', 371), ('other', 226), ('Survey', 35)]
In [55]:
# phone updating
c.execute('select value as num from nodes_tags where key = "phone" limit 10').fetchall()
Out[55]:
[('+861065822892',),
 ('(010)64629112',),
 ('+8601051696505',),
 ('+86-10-60712288',),
 ('68716285;62555813',),
 ('+8613601135725/+861051357212',),
 ('+861051357212',),
 ('+861064428833',),
 ('+861063016688',),
 ('+86-10-64169999',)]

Data Exploration from sql data

Size of files

In [63]:
print ('beijing_china.osm:{} MB'.format(os.path.getsize('beijing_china.osm') >> 20))
print ('p3_project.db:{} MB'.format(os.path.getsize('p3_project.db') >> 20))
print ('ways.csv:{} MB'.format(os.path.getsize('ways.csv') >> 20))
print ('ways_nodes:{} MB'.format(os.path.getsize('ways_nodes.csv') >> 20))
print ('ways_tags:{} MB'.format(os.path.getsize('ways_tags.csv') >> 20))
print('nodes_tags:{} MB'.format(os.path.getsize('nodes_tags.csv') >> 20))
print ('nodes:{} MB'.format(os.path.getsize('nodes.csv') >> 20))
beijing_china.osm:164 MB
p3_project.db:87 MB
ways.csv:6 MB
ways_nodes:21 MB
ways_tags:7 MB
nodes_tags:2 MB
nodes:60 MB

Top 10 contributing users of Beijing map

In [51]:
#calculate the top 10 contributors for nodes of beijing map
sql_user = 'select distinct(user), count(*) as num\
            from (select id,user,uid from nodes\
            union \
            select id,user,uid from ways)\
            group by user\
            order by num desc\
            limit 10'
c.execute(sql_user).fetchall()
Out[51]:
[('Chen Jia', 198317),
 ('R438', 148126),
 ('hanchao', 66845),
 ('ij_', 52041),
 ('katpatuka', 23674),
 ('m17design', 21795),
 ('Esperanza36', 18399),
 ('nuklearerWintersturm', 17041),
 ('RationalTangle', 14089),
 ('Алекс Мок', 10621)]

Hutong numbers in Beijing

In [35]:
# total hutong numbers in Beijing
sql_user = ('select distinct(value) from nodes_tags where value like "%胡同"\
            union\
            select distinct(value) from ways_tags where value like "%胡同"')
len(c.execute(sql_user).fetchall())
Out[35]:
641

Source distribution

In [64]:
# source distribution
sql_user = 'select distinct(value),count(*) as num\
            from ways_tags\
            where key = "source"\
            group by value\
            order by num desc'
c.execute(sql_user).fetchall()
Out[64]:
[('Bing', 3319), ('Yahoo', 1027), ('GPS', 371), ('other', 226), ('Survey', 35)]

Problems encountered in this project

1. How to deal with Chinese characters

It is inavoidable that some Chinese character is shown in Beijing map. The problem is that some are in English character but other are in Chinese, for example, the meaning of<font color = blue> Hutong</font> is <font color =blue>胡同</font>, this case is typical, so I just replace 'hutong' with '胡同'. But it is tricky to translate Chinese pinying to Chinese, hence I did not translate pinying to Chines

2. How to translate English into Chinese

I search this question online, and find that a python library named goslate is a common translate api, so I install it just by pip install. but I recognized that goslate can not tranlate pinyin into Chinese characters, goslate will do nothing when encountering strings written by pinyin and English. Lastly, I have to replace those characters by mapping function.

3. Blank lines between CSV each rows

The first time I try parse OSM file and convert it into CSV file, I found the there are blanks between two data lines when I open it by Excel, and then I google this question, and found this answer.

  • Element introduction: wiki

Other ideas about the data set

Suggestions

  1. Suggestions on imporving the data set
    • Phone format should be previously set: In the process of data viewing of this data set, I have noticed that the phone format is varied, even some of them are wrong. The format should be set as like <font color = blue>"+86 010[hint: 8 number of the phone]"</font>.
    • Node tags: In this dataset, node tags such as cuisine, name, amenity are written in either English or Chinese or pinying, this will be tough for me to clear three different kinds of data. The suggestion on this issue is add a hint while users putting the information and prompt them do not use pinying.
  1. Suggstions on analysing this data set

    • Lots of information could be analyzed by using map data, such as some statistical analysis like how many cuisines and hotels here in Beijing, we can also try to visualized those information on the map.
    • Check the integrity of the map data by using other map data.

pros and cons

  1. Phone format should be previously set:
    • pros: most of phones will be in uniform format.
    • cons: no cons
    • expections: phone formats will not as dirty as the data in this dataset.
  2. node tags
    • pros: no pinying presents in the data
    • cons: some people will encounter the situation of unknowing the chinese characters, if that, they will give a wrong character.
    • expection: English can be easily translated to Chinese, so this part will be easy to deal with
  3. visualize the information on the map
    • pros: More interesting insights can be found by visualing the info.(such as cuision, school and hotels, etc.) on the map, by this.
    • cons: if the map is not complete, it will convey wrong info. to us

Conclusions

<font size = 3>By those project, I found the main problem of this beijing openstreetmap is its disunity. One way or node could be described by English, Chinese or pinying, apparently this would increase the difficulty of data cleaning process. Another problem is phone format which is various in this case, but for me I am not interesting about the phone, that is why I did not do much work on this.

Back to the whole process:
The key procedures are following:

  • Parse the xml file and save the data to a dictionary, the data clean process occurs in this part, the function is shape_element()
  • Transfer data in dictionary to csv files
  • Save csv files to sql by pandas to_sql modules


Last but not least, have faith in yourself, I originally should submit this project long time ago, but I thought it is too difficult for me to solve, so I started to learn other course and leaving this course no the half way. After finishing most of course of udacity, I return back to conquer this course. But now I solve this problems and clean the data from xml file and then save the data to database successfully, I also can make it!