Mercurial > hg > GlobalNeighbors
comparison README.txt @ 0:5dba84370182
initial commit; half-working prototype
| author | Jeff Hammel <k0scist@gmail.com> |
|---|---|
| date | Sat, 24 Jun 2017 12:03:39 -0700 |
| parents | |
| children | d1b99c695511 |
comparison
equal
deleted
inserted
replaced
| -1:000000000000 | 0:5dba84370182 |
|---|---|
| 1 # GlobalNeighbors | |
| 2 | |
| 3 locate nearest cities | |
| 4 | |
| 5 ## Making the World Local | |
| 6 | |
| 7 ## Dataset | |
| 8 | |
| 9 The data is taken from http://download.geonames.org/export/dump/cities1000.zip , | |
| 10 all cities with a population > 1000 or seats of adm div (ca 150.000), | |
| 11 according to http://download.geonames.org/export/dump/ . | |
| 12 `wc -l` indicates that the text file has 149092 records. | |
| 13 | |
| 14 The schema is given by: | |
| 15 | |
| 16 geonameid : integer id of record in geonames database | |
| 17 name : name of geographical point (utf8) varchar(200) | |
| 18 asciiname : name of geographical point in plain ascii characters, varchar(200) | |
| 19 alternatenames : alternatenames, comma separated, ascii names automatically transliterated, convenience attribute from alternatename table, varchar(10000) | |
| 20 latitude : latitude in decimal degrees (wgs84) | |
| 21 longitude : longitude in decimal degrees (wgs84) | |
| 22 feature class : see http://www.geonames.org/export/codes.html, char(1) | |
| 23 feature code : see http://www.geonames.org/export/codes.html, varchar(10) | |
| 24 country code : ISO-3166 2-letter country code, 2 characters | |
| 25 cc2 : alternate country codes, comma separated, ISO-3166 2-letter country code, 200 characters | |
| 26 admin1 code : fipscode (subject to change to iso code), see exceptions below, see file admin1Codes.txt for display names of this code; varchar(20) | |
| 27 admin2 code : code for the second administrative division, a county in the US, see file admin2Codes.txt; varchar(80) | |
| 28 admin3 code : code for third level administrative division, varchar(20) | |
| 29 admin4 code : code for fourth level administrative division, varchar(20) | |
| 30 population : bigint (8 byte int) | |
| 31 elevation : in meters, integer | |
| 32 dem : digital elevation model, srtm3 or gtopo30, average elevation of 3''x3'' (ca 90mx90m) or 30''x30'' (ca 900mx900m) area in meters, integer. srtm processed by cgiar/ciat. | |
| 33 timezone : the iana timezone id (see file timeZone.txt) varchar(40) | |
| 34 modification date : date of last modification in yyyy-MM-dd format | |
| 35 | |
| 36 (Ref: http://download.geonames.org/export/dump/) | |
| 37 | |
| 38 | |
| 39 ## Distance Calculation | |
| 40 | |
| 41 A spherical projection is assumed for the Earth. | |
| 42 | |
| 43 In order to calculated the distance between cities, | |
| 44 the Haversine formula is used (see | |
| 45 https://en.wikipedia.org/wiki/Haversine_formula ). | |
| 46 | |
| 47 The naive approach, which we take here is to calculate | |
| 48 all the distances between all cities. This is notably | |
| 49 an `N^2` algorithm (or technically I believe `0.5*N*(N-1)`) | |
| 50 | |
| 51 In this case roughly: | |
| 52 ``` | |
| 53 O = 0.5*149092*149092 approx 10 billion | |
| 54 ``` | |
| 55 | |
| 56 This is no small task even with modern computers. CPU, Memory, | |
| 57 and Storage are non negligible even for this modern data set. | |
| 58 | |
| 59 So first we filter down based on boxing on latitude and longitude. | |
| 60 This is not 100% correct as it ignores the singularity at the poles | |
| 61 but this is probably tolerable since there are few cities of magnitude | |
| 62 in these regions. | |
| 63 | |
| 64 So performance is important. | |
| 65 | |
| 66 Just iterating over the `0.5*N*(N-1)` without any calculations | |
| 67 is expensive in python: just iterating over the items | |
| 68 is about 30 minutes on this computer (in series). | |
| 69 Keeping track of neighbors within +/- 1 degree latitude and | |
| 70 longitude takes closer to an hour. Boxing this way gets | |
| 71 a maximum number of neighbors of 2376 and a minimum value of 0 | |
| 72 with a mean of 222. This is stored in `neighbors.json` in `data`. | |
| 73 There are 72897 cases with less than 100 in this box and | |
| 74 18444 with less than 10. | |
| 75 | |
| 76 | |
| 77 ## Web Service | |
| 78 | |
| 79 A simple web application was developed using | |
| 80 `gunicorn` (http://gunicorn.org/) as the server. | |
| 81 | |
| 82 ## Deployment notes | |
| 83 | |
| 84 - parallelism: the distance calculation is done serially. As such | |
| 85 it is a `O(10^10)` operation on the dataset. This should be improved | |
| 86 and parallelized | |
| 87 | |
| 88 | |
| 89 ## (Hopefully) Helpful Links | |
| 90 | |
| 91 - http://geojsonlint.com/ | |
| 92 - https://en.wikipedia.org/wiki/Haversine_formula | |
| 93 | |
| 94 ---- | |
| 95 | |
| 96 Jeff Hammel |
