Report on OpenStreetMap Data Cleaning Project¶

Contents¶

Map area
Data Overview
- Files size : know the size of the selected file
- Tags number : know how many tags in the files
Data cleaning
- Main problems of data
- Results after updating
Data Exploration from sql data
- Size of files
- Top 10 contributing users of Beijing map
- Hutong numbers in Beijing
- Source distribution
Problems encountered and solutions
Other ideas about this data set
Conclusions

Map area¶

Beijing, China

Beijing is the capital of China with a lot of acient buildings and modern good-designed architectures. I would like to explore some information about our great capital.
The link of this file is: link

Data Overview¶

Files Size
The size of the file is **164 megabytes**, this is qualified with the requirement of udacity course

In [60]:

import os
print ('beijing_china.osm:{} MB'.format(os.path.getsize('beijing_china.osm') >> 20))

beijing_china.osm:164 MB

Tags number 
Before cleanig data, I want to have a overview of data that I have, the first thing I wanna check is tags number, and which is counting by this function named count_tag()
Outputs:
{'osm': 1, 'bounds': 1, 'node': 777694, 'tag': 339853, 'way': 115041, 'nd': 928128, 'relation': 5601, 'member': 60922}

In [ ]:

def count_tags(filename):
    tags = {}
    taglist = []
    for _,elem in ET.iterparse(filename,events=('start',)):
        taglist.append(elem.tag)
    for tag in taglist:
        if tag not in tags:
            tags[tag] = 1
        else:
            tags[tag] += 1
    return tags

Data cleaning process¶

1. Main problems of data¶

Way and node names: There are a lot of way names, most of are in Chinese, but some are in English and Pinyin, such as **Rongxian Hutong and Embassy of Germany** , those names need to be converted into Chinese characters for further exploration.
Phone format problems: some phone is written as '01087671788',but another is writen as '+86 010 69618888'
Node sourse: samples node sourse is like '{'Bing', 'GPX', 'bing', 'gps','Bing, 2005-04'}'
Cuisine: samples cuisine is like '{'american;burger', 'chinese', 'german'}',which will changed to country name

2.Data cleaning function¶

The function used in the part is shown as below, names of nodes and ways in pinyin will be repalce into Chinese charactes, such as 'Rongxian Hutong' will be changed into 'Rongxian 胡同', translating pinyin into Chinese is tricky, so I did nothing with that, and actually, I pay more attention to the type of a node or way(such as '胡同' or '公路') instead of it's special name.

In [1]:

# functions for updating cuisine phone and source of ways or nodes
def mapping(name,mapping_dict):
    name_list = name.split()
    if name_list[-1] in mapping_dict:
        name_list[-1] = mapping_dict[name_list[-1]]
        return ' '.join(name_list)
mapping_dict = {
    'Road':'路',
    'Expressway':'高速公路',
    'Lu':'路',
    'lu':'路',
    'Hutong':'胡同',
    'hutong':'胡同',
    'Embassy':'大使馆',
    'Coffee':'咖啡厅'
}

def update_cuisine(cuisine):
    return cuisine.split(';')[0]
def update_phone(phone):
    if len(phone) == 8:
        return '+86010'+phone
    elif len(phone) == 11:
        return '+86'+phone
    elif len(phone) >11:
        return phone.replace(' ','')
    elif len(phone) <8:
        return 'Error'+'+'+phone
    elif len(phone) >8 and len(phone) < 11:
        return 'Error'+'+'+phone
def update_source(source):
    import re
    sourcename = re.split(r'[,;. ]',source)
    sourcename_0 = sourcename[0]
    reg_bing = re.compile(re.escape('bing'), re.IGNORECASE)
    reg_gps = re.compile(re.escape('gps'), re.IGNORECASE)
    reg_yahoo = re.compile(re.escape('yahoo'), re.IGNORECASE)
    reg_survey = re.compile(re.escape('survey'), re.IGNORECASE)
    sourcename_0 = reg_bing.sub('Bing', sourcename_0)
    sourcename_0 = reg_gps.sub('GPS', sourcename_0)
    sourcename_0 = reg_yahoo.sub('Yahoo', sourcename_0)
    sourcename_0 = reg_survey.sub('Survey', sourcename_0)
    if sourcename_0 not in ['Bing','GPS','Yahoo','Survey']:
        sourcename_0 = 'other'
    return sourcename_0

3.Results after updating¶

The data after cleaning via update_functions will be saved to the database:

source:There are lots of sources, I just clean main source such as bing, gps etc. but some other source like 'local' and 'map.sogou.com' are also been found, those source can be classified as 'others'.
phone:The phone format here is more complex than what I thought, I did not do much work on this since few things can be done by phone number analysis.

In [53]:

# source updating, 
import sqlite3
conn = sqlite3.connect('p3_project.db')
c = conn.cursor()
c.execute('select value,count(*) as num from ways_tags where key = "source" group by value order by num desc').fetchall()

Out[53]:

[('Bing', 3319), ('Yahoo', 1027), ('GPS', 371), ('other', 226), ('Survey', 35)]

In [55]:

# phone updating
c.execute('select value as num from nodes_tags where key = "phone" limit 10').fetchall()

Out[55]:

[('+861065822892',),
 ('(010)64629112',),
 ('+8601051696505',),
 ('+86-10-60712288',),
 ('68716285;62555813',),
 ('+8613601135725/+861051357212',),
 ('+861051357212',),
 ('+861064428833',),
 ('+861063016688',),
 ('+86-10-64169999',)]

Data Exploration from sql data¶

Size of files¶

In [63]:

print ('beijing_china.osm:{} MB'.format(os.path.getsize('beijing_china.osm') >> 20))
print ('p3_project.db:{} MB'.format(os.path.getsize('p3_project.db') >> 20))
print ('ways.csv:{} MB'.format(os.path.getsize('ways.csv') >> 20))
print ('ways_nodes:{} MB'.format(os.path.getsize('ways_nodes.csv') >> 20))
print ('ways_tags:{} MB'.format(os.path.getsize('ways_tags.csv') >> 20))
print('nodes_tags:{} MB'.format(os.path.getsize('nodes_tags.csv') >> 20))
print ('nodes:{} MB'.format(os.path.getsize('nodes.csv') >> 20))

beijing_china.osm:164 MB
p3_project.db:87 MB
ways.csv:6 MB
ways_nodes:21 MB
ways_tags:7 MB
nodes_tags:2 MB
nodes:60 MB

Top 10 contributing users of Beijing map¶

In [51]:

#calculate the top 10 contributors for nodes of beijing map
sql_user = 'select distinct(user), count(*) as num\
            from (select id,user,uid from nodes\
            union \
            select id,user,uid from ways)\
            group by user\
            order by num desc\
            limit 10'
c.execute(sql_user).fetchall()

Out[51]:

[('Chen Jia', 198317),
 ('R438', 148126),
 ('hanchao', 66845),
 ('ij_', 52041),
 ('katpatuka', 23674),
 ('m17design', 21795),
 ('Esperanza36', 18399),
 ('nuklearerWintersturm', 17041),
 ('RationalTangle', 14089),
 ('Алекс Мок', 10621)]

Hutong numbers in Beijing¶

In [35]:

# total hutong numbers in Beijing
sql_user = ('select distinct(value) from nodes_tags where value like "%胡同"\
            union\
            select distinct(value) from ways_tags where value like "%胡同"')
len(c.execute(sql_user).fetchall())

Out[35]:

Source distribution¶

In [64]:

# source distribution
sql_user = 'select distinct(value),count(*) as num\
            from ways_tags\
            where key = "source"\
            group by value\
            order by num desc'
c.execute(sql_user).fetchall()

Out[64]:

[('Bing', 3319), ('Yahoo', 1027), ('GPS', 371), ('other', 226), ('Survey', 35)]

Problems encountered in this project¶

1. How to deal with Chinese characters¶

It is inavoidable that some Chinese character is shown in Beijing map. The problem is that some are in English character but other are in Chinese, for example, the meaning of Hutong is 胡同, this case is typical, so I just replace 'hutong' with '胡同'. But it is tricky to translate Chinese pinying to Chinese, hence I did not translate pinying to Chines

2. How to translate English into Chinese¶

I search this question online, and find that a python library named goslate is a common translate api, so I install it just by pip install. but I recognized that goslate can not tranlate pinyin into Chinese characters, goslate will do nothing when encountering strings written by pinyin and English. Lastly, I have to replace those characters by mapping function.

3. Blank lines between CSV each rows¶

The first time I try parse OSM file and convert it into CSV file, I found the there are blanks between two data lines when I open it by Excel, and then I google this question, and found this answer.

4. Other links used¶

Element introduction: wiki

Other ideas about the data set¶

Suggestions¶

Suggestions on imporving the data set
- Phone format should be previously set: In the process of data viewing of this data set, I have noticed that the phone format is varied, even some of them are wrong. The format should be set as like "+86 010[hint: 8 number of the phone]".
- Node tags: In this dataset, node tags such as cuisine, name, amenity are written in either English or Chinese or pinying, this will be tough for me to clear three different kinds of data. The suggestion on this issue is add a hint while users putting the information and prompt them do not use pinying.

Suggstions on analysing this data set
- Lots of information could be analyzed by using map data, such as some statistical analysis like how many cuisines and hotels here in Beijing, we can also try to visualized those information on the map.
- Check the integrity of the map data by using other map data.

pros and cons¶

Phone format should be previously set:
- pros: most of phones will be in uniform format.
- cons: no cons
- expections: phone formats will not as dirty as the data in this dataset.
node tags
- pros: no pinying presents in the data
- cons: some people will encounter the situation of unknowing the chinese characters, if that, they will give a wrong character.
- expection: English can be easily translated to Chinese, so this part will be easy to deal with
visualize the information on the map
- pros: More interesting insights can be found by visualing the info.(such as cuision, school and hotels, etc.) on the map, by this.
- cons: if the map is not complete, it will convey wrong info. to us

Conclusions¶

By those project, I found the main problem of this beijing openstreetmap is its disunity. One way or node could be described by English, Chinese or pinying, apparently this would increase the difficulty of data cleaning process. Another problem is phone format which is various in this case, but for me I am not interesting about the phone, that is why I did not do much work on this.

Back to the whole process:
The key procedures are following:

Parse the xml file and save the data to a dictionary, the data clean process occurs in this part, the function is shape_element()
Transfer data in dictionary to csv files
Save csv files to sql by pandas to_sql modules

Last but not least, have faith in yourself, I originally should submit this project long time ago, but I thought it is too difficult for me to solve, so I started to learn other course and leaving this course no the half way. After finishing most of course of udacity, I return back to conquer this course. But now I solve this problems and clean the data from xml file and then save the data to database successfully, I also can make it!